regular expressions and the LOCALE flag

Baz Walter bazwal at ftml.net
Tue Aug 3 13:56:55 EDT 2010


the python docs say that re.LOCALE makes certain character classes 
"dependent on the current locale".

here's what i currently see on my system:

 >>> import re, locale
 >>> locale.getdefaultlocale()
('en_GB', 'UTF8')
 >>> locale.getlocale()
(None, None)
 >>> re.findall(r'\w', u'a b c \xe5 \xe6 \xe7', re.L)
[u'a', u'b', u'c']
 >>> locale.setlocale(locale.LC_ALL, 'en_GB.ISO 8859-1')
'en_GB.ISO 8859-1'
 >>> re.findall(r'\w', u'\xe5 \xe6 \xe7 a b c', re.L)
[u'\xe5', u'\xe6', u'\xe7', u'a', u'b', u'c']
 >>> locale.setlocale(locale.LC_ALL, 'en_GB.UTF-8')
'en_GB.UTF-8'
 >>> re.findall(r'\w', u'a b c \xe5 \xe6 \xe7', re.L)
[u'a', u'b', u'c']

it seems wrong to me that re.LOCALE fails to give the "right" result 
when the local encoding is utf8 - i think it should give the same result 
as re.UNICODE.

is this a bug, or does the documentation just need to be made clearer?



More information about the Python-list mailing list