regular expressions and the LOCALE flag

Baz Walter bazwal at
Wed Aug 4 01:27:55 CEST 2010

On 03/08/10 21:24, MRAB wrote:
>>> And, BTW, none of your examples pass a UTF-8 bytestring to re.findall:
>>> all those string literals starting with the 'u' prefix are Unicode
>>> strings!
>> not sure what you mean by this: if the string was encoded as utf8,
>> '\w' still wouldn't match any of the non-ascii characters.
> Strings with the 'u' prefix are Unicode strings, not bytestrings. They
> don't have an encoding.

well, they do if they are given one, as i suggested!

to be explicit, if the local encoding is 'utf8', none of the following 
will get a hit:

(1) re.findall(r'\w', '\xe5 \xe6 \xe7', re.L)
(2) re.findall(r'\w', u'\xe5 \xe6 \xe7'.encode('utf8'), re.L)
(3) re.findall(r'\w', u'\xe5 \xe6 \xe7', re.L)

so i still don't know what you meant about passing a 'UTF-8 bytestring' 
in your first comment :)

only (3) could feasibly get a hit - and then only if the re module was 
smart enough to fall back to re.UNICODE for utf8 (and any other 
encodings of unicode it might know about).

> 2. LOCALE: bytestring with characters in the current locale (but only 1
> byte per character). Characters are categorised according to the
> underlying C library; for example, 'a' is a letter if isalpha('a')
> returns true.

this is actually what my question was about. i suspected something like 
this might be the case, but i can't actually see it stated anywhere in 
the docs. maybe it's just me, but 'current locale' doesn't naturally 
imply 'only 8-bit encodings'. i would have thought it implied 'whatever 
encoding is discovered on the local system' - and these days, that's 
very commonly utf8.

is there actually a use case for it working the way it currently does? 
it seems just broken to have it depending so heavily on implementation 

> 3. UNICODE (default in Python 3): Unicode string.

i've just read the python3 re docs, and they do now make an explicit 
distinction between matching bytes (with the new re.ASCII flag) and 
matching textual characters (i.e. unicode, the default). the re.LOCALE 
flag is still there, and there are now warnings about it's unreliability 
- but it still doesn't state that it can only work properly if the local 
encoding is 8-bit.

More information about the Python-list mailing list