[I18n-sig] Re: Pre-PEP: Python Character Model
Fredrik Lundh
fredrik@effbot.org
Sun, 11 Feb 2001 11:14:25 +0100
(trying to catch up from the archives; just realized
that I wasn't subscribed to i18n)
> > I'm lost here. Let's say I'm using Python 1.5. I have some KOI8-R dat=
a
> > in a string literal. PythonWin and Tk expect Unicode. How could they
> > display the characters correctly?
>
> No, PythonWin and Tk both tell apart Unicode and byte strings
> (although Tk uses quite a funny algorithm to do so). If they see a
> byte string, they convert it using the platform encoding (which is
> user-settable on both Windows and Unix) to a Unicode string, and
> display that.
Not quite true for Tk: Tcl's 8-bit to Unicode conversion expects
UTF-8. When it sees a lead byte with not enough trailbytes, the
lead byte is copied as is. Naked trail bytes are also copied as is.
Under Latin-1, the following three Python strings all result in the
same Tcl string value:
str =3D "=E5=E4=F6"
str =3D u"=E5=E4=F6".encode("utf-8")
str =3D u"=E5=E4=F6"
But under a hypothetical platform encoding where "=E5" looks like
a UTF-8 lead byte, and "=E4" like a trail byte, this will fail (if you
think that's unlikely, feel tree to replace "=E5" and "=E4" with other
characters...).
Cheers /F