[I18n-sig] Re: [Python-Dev] Unicode debate
Just van Rossum
Tue, 2 May 2000 13:34:57 +0100
At 10:36 AM +0200 02-05-2000, M.-A. Lemburg wrote:
>Just a small note on the subject of a character being atomic
>which seems to have been forgotten by the discussing parties:
>Unicode itself can be understood as multi-word character
>encoding, just like UTF-8. The reason is that Unicode entities
>can be combined to produce single display characters (e.g.
>u"e"+u"\u0301" will print "=E9" in a Unicode aware renderer).
Erm, are you sure Unicode prescribes this behavior, for this
example? I know similar behaviors are specified for certain
languages/scripts, but I didn't know it did that for latin.
>Slicing such a combined Unicode string will have the same
>effect as slicing UTF-8 data.
Not true. As Fredrik noted: no exception will be raised.
[ Speaking of exceptions,
after I sent off my previous post I realized Guido's
argument can easily be turned around, backfiring at utf-8:
Defaulting to utf-8 when going from Unicode to 8-bit and
back only gives the *illusion* things "just work", since it
will *silently* "work", even if utf-8 is *not* the desired
8-bit encoding -- as shown by Fredrik's excellent "fun with
Unicode, part 1" example. Defaulting to Latin-1 will
warn the user *much* earlier, since it'll barf when
converting a Unicode string that contains any character
code > 255. So there.
>It seems that most Latin-1 proponents seem to have single
>display characters in mind. While the same is true for
>many Unicode entities, there are quite a few cases of
>combining characters in Unicode 3.0 and the Unicode
>nomarization algorithm uses these as basis for its
Still, two combining characters are still two input characters for
the renderer! They may result in one *glyph*, but trust me,
that's an entirly different can of worms.
However, if you'd be talking about Unicode surrogates,
you'd definitely have a point. How do Java/Perl/Tcl deal with