comparing Unicode and string
Neil Cerutti
horpner at yahoo.com
Mon Oct 16 10:12:51 EDT 2006
On 2006-10-16, luc.saffre at gmail.com <luc.saffre at gmail.com> wrote:
> Hello,
>
> here is something that surprises me.
>
> #coding: iso-8859-1
I think that's supposed to be:
# -*- coding: iso-8859-1 -*-
The special comment changes only the encoding of unicode
literals. In particular, it doesn't change the default encoding
of str literals.
> s1=u"Frau Müller machte große Augen"
> s2="Frau Müller machte große Augen"
> if s1 == s2:
> pass
On my machine, the ü and ß in s2 are being stored in the code
points of my terminal's encoding, cp437. Unforunately cp437 code
points from 127-255 are not the same as those in iso-8859-1.
To fix this, I have to do the following:
>>> s1 == s2.decode('cp437')
True
> Running this code produces a UnicodeDecodeError:
>
> Traceback (most recent call last):
> File "tmp.py", line 4, in ?
> if s1 == s2:
> UnicodeDecodeError: 'ascii' codec can't decode byte 0xfc in position 6:
> ordinal not in range(128)
>
> I would have expected that "s1 == s2" gives True... or maybe
> False... but raising an error here is unnecessary. I guess that
> the comparison operator decides to convert s2 to a Unicode but
> forgets that I said #coding: iso-8859-1 at the beginning of the
> file.
It's trying to interpret s2 as ascii, and failing, since 129 and
225 code points are out of range.
--
Neil Cerutti
More information about the Python-list
mailing list