comparing Unicode and string

Neil Cerutti horpner at yahoo.com
Mon Oct 16 10:12:51 EDT 2006


On 2006-10-16, luc.saffre at gmail.com <luc.saffre at gmail.com> wrote:
> Hello,
>
> here is something that surprises me.
>
>   #coding: iso-8859-1

I think that's supposed to be:

# -*- coding: iso-8859-1 -*-

The special comment changes only the encoding of unicode
literals. In particular, it doesn't change the default encoding
of str literals.

>   s1=u"Frau Müller machte große Augen"
>   s2="Frau Müller machte große Augen"
>   if s1 == s2:
>       pass

On my machine, the ü and ß in s2 are being stored in the code
points of my terminal's encoding, cp437. Unforunately cp437 code
points from 127-255 are not the same as those in iso-8859-1.

To fix this, I have to do the following:

>>> s1 == s2.decode('cp437')
True

> Running this code produces a UnicodeDecodeError:
>
> Traceback (most recent call last):
>   File "tmp.py", line 4, in ?
>     if s1 == s2:
> UnicodeDecodeError: 'ascii' codec can't decode byte 0xfc in position 6:
> ordinal not in range(128)
>
> I would have expected that "s1 == s2" gives True... or maybe
> False... but raising an error here is unnecessary. I guess that
> the comparison operator decides to convert s2 to a Unicode but
> forgets that I said #coding: iso-8859-1 at the beginning of the
> file.

It's trying to interpret s2 as ascii, and failing, since 129 and
225 code points are out of range.

-- 
Neil Cerutti



More information about the Python-list mailing list