[I18n-sig] Re: [Python-Dev] Unicode debate

Tim Peters tim_one@email.msn.com
Wed, 3 May 2000 01:47:37 -0400


[Moshe Zadka]
> ...
> I'd much prefer Python to reflect a fundamental truth about Unicode,
> which at least makes sure binary-goop can pass through Unicode and
> remain unharmed, then to reflect a nasty problem with UTF-8 (not
> everything is legal).

Then you don't want Unicode at all, Moshe.  All the official encoding
schemes for Unicode 3.0 suffer illegal byte sequences (for example, 0xffff
is illegal in UTF-16 (whether BE or LE); this isn't merely a matter of
Unicode not yet having assigned a character to this position, it's that the
standard explicitly makes this sequence illegal and guarantees it will
always be illegal!  the other place this comes up is with surrogates, where
what's legal depends on both parts of a character pair; and, again, the
illegalities here are guaranteed illegal for all time).  UCS-4 is the
closest thing to binary-transparent Unicode encodings get, but even there
the length of a thing is contrained to be a multiple of 4 bytes.  Unicode
and binary goop will never coexist peacefully.