[Tutor] A regular expression problem
Steven D'Aprano
steve at pearwood.info
Wed Dec 1 12:19:48 CET 2010
Josep M. Fontana wrote:
[...]
> I guess this is because the character encoding was not specified but
> accented characters in the languages I'm dealing with should be
> treated as a-z or A-Z, shouldn't they?
No. a-z means a-z. If you want the localized set of alphanumeric
characters, you need \w.
Likewise 0-9 means 0-9. If you want localized digits, you need \d.
> I mean, how do you deal with
> languages that are not English with regular expressions? I would
> assume that as long as you set the right encoding, Python will be able
> to determine which subset of specific sequences of bytes count as a-z
> or A-Z.
Encodings have nothing to do with this issue.
Literal characters a, b, ..., z etc. always have ONE meaning: they
represent themselves (although possibly in a case-insensitive fashion).
E means E, not È, É, Ê or Ë.
Localization tells the regex how to interpret special patterns like \d
and \w. This has nothing to do with encodings -- by the time the regex
sees the string, it is already dealing with characters. Localization is
about what characters are in categories ("is 5 a digit or a letter? how
about ٣ ?").
Encoding is used to translate between bytes on disk and characters. For
example, the character Ë could be stored on disk as the hex bytes:
\xcb # one byte
\xc3\x8b # two bytes
\xff\xfe\xcb\x00 # four bytes
and more, depending on the encoding used.
--
Steven
More information about the Tutor
mailing list