[Tutor] A regular expression problem

Steven D'Aprano steve at pearwood.info
Wed Dec 1 12:19:48 CET 2010


Josep M. Fontana wrote:
[...]
> I guess this is because the character encoding was not specified but
> accented characters in the languages I'm dealing with should be
> treated as a-z or A-Z, shouldn't they? 

No. a-z means a-z. If you want the localized set of alphanumeric 
characters, you need \w.

Likewise 0-9 means 0-9. If you want localized digits, you need \d.


 > I mean, how do you deal with
> languages that are not English with regular expressions? I would
> assume that as long as you set the right encoding, Python will be able
> to determine which subset of specific sequences of bytes count as a-z
> or A-Z.

Encodings have nothing to do with this issue.

Literal characters a, b, ..., z etc. always have ONE meaning: they 
represent themselves (although possibly in a case-insensitive fashion). 
E means E, not È, É, Ê or Ë.

Localization tells the regex how to interpret special patterns like \d 
and \w. This has nothing to do with encodings -- by the time the regex 
sees the string, it is already dealing with characters. Localization is 
about what characters are in categories ("is 5 a digit or a letter? how 
about ٣ ?").

Encoding is used to translate between bytes on disk and characters. For 
example, the character Ë could be stored on disk as the hex bytes:

\xcb              # one byte
\xc3\x8b          # two bytes
\xff\xfe\xcb\x00  # four bytes

and more, depending on the encoding used.


-- 
Steven


More information about the Tutor mailing list