python encoding bug?

Sat Dec 31 08:03:25 EST 2005

garabik-news-2005-05 at kassiopeia.juls.savba.sk wrote:

> 
> I was playing with python encodings and noticed this:
> 
> garabik at lancre:~$ python2.4
> Python 2.4 (#2, Dec  3 2004, 17:59:05)
> [GCC 3.3.5 (Debian 1:3.3.5-2)] on linux2
> Type "help", "copyright", "credits" or "license" for more information.
>>>> unicode('\x9d', 'iso8859_1')
> u'\x9d'
>>>>
> 
> U+009D is NOT a valid unicode character (it is not even a iso8859_1
> valid character)

It *IS* a valid unicode and iso8859-1 character, so the behaviour of the
python decoder is correct. The range U+0080 - U+009F is used for various
control characters. There's rarely a valid use for these characters in
documents, so you can be pretty sure that a document using these characters
is windows-1252 - it is valid iso-8859-1, but for a heuristic guess it's
probably saver to assume windows-1252.

If you want an exception to be thrown, you'll need to implement your own
codec, something like 'iso8859_1_nocc' - mmm.. I could try this myself,
because I do such a test in one of my projects, too ;)

> The same happens if I use 'latin-1' instead of 'iso8859_1'.
> 
> This caught me by surprise, since I was doing some heuristics guessing
> string encodings, and 'iso8859_1' gave no errors even if the input
> encoding was different.
> 
> Is this a known behaviour, or I discovered a terrible unknown bug in
> python encoding implementation that should be immediately reported and
> fixed? :-)
> 
> 
> happy new year,
> 

-- 
Benjamin Niemann
Email: pink at odahoda dot de
WWW: http://www.odahoda.de/