unicode woes

Sun Oct 6 03:41:55 EDT 2002

"Matt Gerrans" <mgerrans at mindspring.com> writes:

> > - Never mix byte strings and Unicode strings (unless the byte strings
> >   are restricted to bytes <127, perhaps).
> 
> Could you elaborate a bit more on this point?   I thought this was okay to
> do, since the byte string will be promoted to unicode.   For instance:
> 
> >>> 'abcd' + u'zyxw'
> u'abcdzyxw'
> 
> Is this an acceptible thing to do?

Yes, this is a case where all bytes in the byte string are below 127,
and thus can be reasonable considered to be ASCII.

>>> s="Hallöchen" + u'zyxw'
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
UnicodeDecodeError: 'ascii' codec can't decode byte 0xf6 in position 4: ordinal not in range(128)

This is a case where promotion to Unicode doesn't work. People are
tempted to solve this by setting the default encoding, but I advise
against this.

Even for ASCII strings, promotion is best restricted to string
literals. Data coming from some file should be converted to Unicode in
a Unicode application, even if they happen not to use any bytes >127.
Some encodings (e.g. iso-2022-jp) use only bytes below 127, yet are
not ASCII. If you have such data, even the default encoding would
convert the data incorrectly.

Regards,
Martin