regular expressions and the LOCALE flag
MRAB
python at mrabarnett.plus.com
Tue Aug 3 16:24:55 EDT 2010
Baz Walter wrote:
> On 03/08/10 19:40, MRAB wrote:
>> Baz Walter wrote:
>>> the python docs say that re.LOCALE makes certain character classes
>>> "dependent on the current locale".
>>
>> re.LOCALE just passes the character to the underlying C library. It
>> really only works on bytestrings which have 1 byte per character.
>
> the re docs don't specify 8-bit encodings: they just refer to the
> 'current locale'.
>
>> And, BTW, none of your examples pass a UTF-8 bytestring to re.findall:
>> all those string literals starting with the 'u' prefix are Unicode
>> strings!
>
> not sure what you mean by this: if the string was encoded as utf8, '\w'
> still wouldn't match any of the non-ascii characters.
>
Strings with the 'u' prefix are Unicode strings, not bytestrings. They
don't have an encoding. A UTF-8 string is a bytestring in which the
bytes represent Unicode codepoints encoded as UTF-8.
>> Locale encodings are more trouble than they're worth. Unicode is better.
>> :-)
>
> yes, i'm really just trying to decide whether i should offer 'locale' as
> an option in my program. given the unintuitive way re.LOCALE works, i'm
> not sure that i should.
>
> are you saying that it only really makes sense for *bytestrings* to be
> used with re.LOCALE?
>
> if so, the re docs certainly don't make that clear.
The re module can match against 3 types of string:
1. ASCII (default in Python 2): bytestring with characters in the ASCII
range (1 byte per character). However, it doesn't complain if it sees
bytes/characters outside the ASCII range.
2. LOCALE: bytestring with characters in the current locale (but only 1
byte per character). Characters are categorised according to the
underlying C library; for example, 'a' is a letter if isalpha('a')
returns true.
3. UNICODE (default in Python 3): Unicode string.
More information about the Python-list
mailing list