[Python-Dev] Tcl and Unicode
Fredrik Lundh
Fredrik Lundh" <effbot@telia.com
Sun, 8 Oct 2000 13:04:50 +0200
guido:
> > This *should* be correct because Tcl/Tk always uses UTF-8 internally.
> > (Even though it is "lenient" when receiving strings -- if a sequence
> > of characters has no valid Unicode representation, it appears to falls
> > back to Latin-1; I don't know the details of this algorithm.)
Tcl/Tk uses a 16-bit (UCS-2) unicode string type internally, but
their 8-bit strings use UTF-8.
When converting from external 8-bit strings to unicode, they
convert valid UTF-8 sequences to unicode characters just like
Python, but "a lead-byte not followed by enough trail-bytes
represents itself." (in other words, it's cast from an unsigned
char to an unsigned short).
And the chance that any reasonable Latin-1 string would contain
a UTF-8 lead bytes followed by the right number of UTF-8 trail
bytes is close to zero...
(in case anyone wonders, Python's codec thinks that "close
to zero" isn't good enough, so it raises an exception instead)
tim:
> Dunno, but wouldn't be surprised if they had a notion of default encoding,
> and that it simply appears to be Latin-1 to us because American Windows uses
> a superset of Latin-1.
They have a system encoding, but it's not used here -- it's just
that Latin-1 is a subset of Unicode...
</F>