[Python-Dev] Tcl and Unicode

Fredrik Lundh Fredrik Lundh" <effbot@telia.com
Sun, 8 Oct 2000 13:04:50 +0200


guido:
> > This *should* be correct because Tcl/Tk always uses UTF-8 internally.
> > (Even though it is "lenient" when receiving strings -- if a sequence
> > of characters has no valid Unicode representation, it appears to falls
> > back to Latin-1; I don't know the details of this algorithm.)

Tcl/Tk uses a 16-bit (UCS-2) unicode string type internally, but
their 8-bit strings use UTF-8.

When converting from external 8-bit strings to unicode, they
convert valid UTF-8 sequences to unicode characters just like
Python, but "a lead-byte not followed by enough trail-bytes
represents itself." (in other words, it's cast from an unsigned
char to an unsigned short).

And the chance that any reasonable Latin-1 string would contain
a UTF-8 lead bytes followed by the right number of UTF-8 trail
bytes is close to zero...

(in case anyone wonders, Python's codec thinks that "close
to zero" isn't good enough, so it raises an exception instead)

tim:
> Dunno, but wouldn't be surprised if they had a notion of default encoding,
> and that it simply appears to be Latin-1 to us because American Windows uses
> a superset of Latin-1.

They have a system encoding, but it's not used here -- it's just
that Latin-1 is a subset of Unicode...

</F>