regular expressions and the LOCALE flag

MRAB python at mrabarnett.plus.com
Tue Aug 3 14:40:38 EDT 2010


Baz Walter wrote:
> the python docs say that re.LOCALE makes certain character classes 
> "dependent on the current locale".
> 
> here's what i currently see on my system:
> 
>  >>> import re, locale
>  >>> locale.getdefaultlocale()
> ('en_GB', 'UTF8')
>  >>> locale.getlocale()
> (None, None)
>  >>> re.findall(r'\w', u'a b c \xe5 \xe6 \xe7', re.L)
> [u'a', u'b', u'c']
>  >>> locale.setlocale(locale.LC_ALL, 'en_GB.ISO 8859-1')
> 'en_GB.ISO 8859-1'
>  >>> re.findall(r'\w', u'\xe5 \xe6 \xe7 a b c', re.L)
> [u'\xe5', u'\xe6', u'\xe7', u'a', u'b', u'c']
>  >>> locale.setlocale(locale.LC_ALL, 'en_GB.UTF-8')
> 'en_GB.UTF-8'
>  >>> re.findall(r'\w', u'a b c \xe5 \xe6 \xe7', re.L)
> [u'a', u'b', u'c']
> 
> it seems wrong to me that re.LOCALE fails to give the "right" result 
> when the local encoding is utf8 - i think it should give the same result 
> as re.UNICODE.
> 
> is this a bug, or does the documentation just need to be made clearer?

re.LOCALE just passes the character to the underlying C library. It
really only works on bytestrings which have 1 byte per character. UTF-8
encodes codepoints outside the ASCII range to multiple bytes per
codepoint, and the re module will treat each of those bytes as a
separate character.

And, BTW, none of your examples pass a UTF-8 bytestring to re.findall:
all those string literals starting with the 'u' prefix are Unicode
strings!

Locale encodings are more trouble than they're worth. Unicode is better.
:-)



More information about the Python-list mailing list