[I18n-sig] Re: Pre-PEP: Python Character Model

Fredrik Lundh fredrik@effbot.org
Sun, 11 Feb 2001 11:14:25 +0100


(trying to catch up from the archives; just realized
that I wasn't subscribed to i18n)

> > I'm lost here. Let's say I'm using Python 1.5. I have some KOI8-R dat=
a
> > in a string literal. PythonWin and Tk expect Unicode. How could they
> > display the characters correctly?
>
> No, PythonWin and Tk both tell apart Unicode and byte strings
> (although Tk uses quite a funny algorithm to do so). If they see a
> byte string, they convert it using the platform encoding (which is
> user-settable on both Windows and Unix) to a Unicode string, and
> display that.

Not quite true for Tk: Tcl's 8-bit to Unicode conversion expects
UTF-8.  When it sees a lead byte with not enough trailbytes, the
lead byte is copied as is.  Naked trail bytes are also copied as is.

Under Latin-1, the following three Python strings all result in the
same Tcl string value:

    str =3D "=E5=E4=F6"
    str =3D u"=E5=E4=F6".encode("utf-8")
    str =3D u"=E5=E4=F6"

But under a hypothetical platform encoding where "=E5" looks like
a UTF-8 lead byte, and "=E4" like a trail byte, this will fail (if you
think that's unlikely, feel tree to replace "=E5" and "=E4" with other
characters...).

Cheers /F