unicode woes
Martin v. Loewis
martin at v.loewis.de
Sun Oct 6 03:41:55 EDT 2002
"Matt Gerrans" <mgerrans at mindspring.com> writes:
> > - Never mix byte strings and Unicode strings (unless the byte strings
> > are restricted to bytes <127, perhaps).
>
> Could you elaborate a bit more on this point? I thought this was okay to
> do, since the byte string will be promoted to unicode. For instance:
>
> >>> 'abcd' + u'zyxw'
> u'abcdzyxw'
>
> Is this an acceptible thing to do?
Yes, this is a case where all bytes in the byte string are below 127,
and thus can be reasonable considered to be ASCII.
>>> s="Hallöchen" + u'zyxw'
Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeDecodeError: 'ascii' codec can't decode byte 0xf6 in position 4: ordinal not in range(128)
This is a case where promotion to Unicode doesn't work. People are
tempted to solve this by setting the default encoding, but I advise
against this.
Even for ASCII strings, promotion is best restricted to string
literals. Data coming from some file should be converted to Unicode in
a Unicode application, even if they happen not to use any bytes >127.
Some encodings (e.g. iso-2022-jp) use only bytes below 127, yet are
not ASCII. If you have such data, even the default encoding would
convert the data incorrectly.
Regards,
Martin
More information about the Python-list
mailing list