[Tutor] Problems processing accented characters in ISO-8859-1 encoded texts

Thu Dec 23 11:51:42 CET 2010

Josep M. Fontana wrote:
> I am working with texts that are encoded as ISO-8859-1. I have
> included the following two lines at the beginning of my python script:
> 
> !/usr/bin/env python
> # -*- coding: iso-8859-1 -*-
> 
> If I'm not mistaken, this should tell Python that accented characters
> such as 'á', 'Á', 'ö' or 'è' should be considered as alpha-numeric
> characters and therefore matched with a regular expression of the form
> [a-zA-Z].

You are mistaken. a-zA-Z always means the ASCII A to Z, and nothing else.

You are conflating three unrelated problems:

(1) What encoding is used to convert the bytes on disk of the source 
code literals into characters?

(2) What encoding is used for the data fed to the regex engine?

(3) What characters does the regex engine consider to be alphanumeric?

The encoding line only tells Python what encoding to use to read the 
source code. It has no effect on text read from files, or byte strings, 
or anything else. It is only to allow literals and identifiers to be 
decoded correctly, and has nothing to do with regular expressions.

To match accented characters, you can do two things:

(1) explicitly include the accented characters you care about in
     the regular expression;

or

(2) i.   set the current LOCALE to a locale that includes the
          characters you care about;
     ii.  search for the \w regex special sequence; and
     iii. include the ?L flag in the regex.

In both cases, don't forget to use Unicode strings, not byte strings.

For example:

 >>> text = u"...aböyz..."
 >>> re.search(r'[a-zA-Z]+', text).group(0)
u'ab'

Setting the locale on its own isn't enough:

 >>> locale.setlocale(locale.LC_ALL, 'de_DE')
'de_DE'
 >>> re.search(r'[a-zA-Z]+', text).group(0)
u'ab'

Nor is using the locale-aware alphanumeric sequence, since the regex 
engine is still using the default C locale:

 >>> re.search(r'\w+', text).group(0)
u'ab'

But if you instruct the engine to use the current locale instead, then 
it works:

 >>> re.search(r'(?L)\w+', text).group(0)
u'ab\xf6yz'

(Don't be put off by the ugly printing representation of the unicode 
string. \xf6 is just the repr() of the character ö.)

Oh, and just to prove my point that a-z is always ASCII, even with the 
locale set:

 >>> re.search(r'(?L)[a-zA-Z]+', text).group(0)
u'ab'

Note also that \w means alphanumeric, not just alpha, so it will also 
match digits.

-- 
Steven