[Tutor] Problems processing accented characters in ISO-8859-1 encoded texts
Steven D'Aprano
steve at pearwood.info
Thu Dec 23 11:51:42 CET 2010
Josep M. Fontana wrote:
> I am working with texts that are encoded as ISO-8859-1. I have
> included the following two lines at the beginning of my python script:
>
> !/usr/bin/env python
> # -*- coding: iso-8859-1 -*-
>
> If I'm not mistaken, this should tell Python that accented characters
> such as 'á', 'Á', 'ö' or 'è' should be considered as alpha-numeric
> characters and therefore matched with a regular expression of the form
> [a-zA-Z].
You are mistaken. a-zA-Z always means the ASCII A to Z, and nothing else.
You are conflating three unrelated problems:
(1) What encoding is used to convert the bytes on disk of the source
code literals into characters?
(2) What encoding is used for the data fed to the regex engine?
(3) What characters does the regex engine consider to be alphanumeric?
The encoding line only tells Python what encoding to use to read the
source code. It has no effect on text read from files, or byte strings,
or anything else. It is only to allow literals and identifiers to be
decoded correctly, and has nothing to do with regular expressions.
To match accented characters, you can do two things:
(1) explicitly include the accented characters you care about in
the regular expression;
or
(2) i. set the current LOCALE to a locale that includes the
characters you care about;
ii. search for the \w regex special sequence; and
iii. include the ?L flag in the regex.
In both cases, don't forget to use Unicode strings, not byte strings.
For example:
>>> text = u"...aböyz..."
>>> re.search(r'[a-zA-Z]+', text).group(0)
u'ab'
Setting the locale on its own isn't enough:
>>> locale.setlocale(locale.LC_ALL, 'de_DE')
'de_DE'
>>> re.search(r'[a-zA-Z]+', text).group(0)
u'ab'
Nor is using the locale-aware alphanumeric sequence, since the regex
engine is still using the default C locale:
>>> re.search(r'\w+', text).group(0)
u'ab'
But if you instruct the engine to use the current locale instead, then
it works:
>>> re.search(r'(?L)\w+', text).group(0)
u'ab\xf6yz'
(Don't be put off by the ugly printing representation of the unicode
string. \xf6 is just the repr() of the character ö.)
Oh, and just to prove my point that a-z is always ASCII, even with the
locale set:
>>> re.search(r'(?L)[a-zA-Z]+', text).group(0)
u'ab'
Note also that \w means alphanumeric, not just alpha, so it will also
match digits.
--
Steven
More information about the Tutor
mailing list