[I18n-sig] Re: [Python-Dev] Unicode debate

Just van Rossum just@letterror.com
Tue, 2 May 2000 13:34:57 +0100


At 10:36 AM +0200 02-05-2000, M.-A. Lemburg wrote:
>Just a small note on the subject of a character being atomic
>which seems to have been forgotten by the discussing parties:
>
>Unicode itself can be understood as multi-word character
>encoding, just like UTF-8. The reason is that Unicode entities
>can be combined to produce single display characters (e.g.
>u"e"+u"\u0301" will print "=E9" in a Unicode aware renderer).

Erm, are you sure Unicode prescribes this behavior, for this
example? I know similar behaviors are specified for certain
languages/scripts, but I didn't know it did that for latin.

>Slicing such a combined Unicode string will have the same
>effect as slicing UTF-8 data.

Not true. As Fredrik noted: no exception will be raised.

[ Speaking of exceptions,

after I sent off my previous post I realized Guido's
non-utf8-strings-interpreted-as-utf8-will-often-raise-an-exception
argument can easily be turned around, backfiring at utf-8:

    Defaulting to utf-8 when going from Unicode to 8-bit and
    back only gives the *illusion* things "just work", since it
    will *silently* "work", even if utf-8 is *not* the desired
    8-bit encoding -- as shown by Fredrik's excellent "fun with
    Unicode, part 1" example. Defaulting to Latin-1 will
    warn the user *much* earlier, since it'll barf when
    converting a Unicode string that contains any character
    code > 255. So there.
]

>It seems that most Latin-1 proponents seem to have single
>display characters in mind. While the same is true for
>many Unicode entities, there are quite a few cases of
>combining characters in Unicode 3.0 and the Unicode
>nomarization algorithm uses these as basis for its
>work.

Still, two combining characters are still two input characters for
the renderer! They may result in one *glyph*, but trust me,
that's an entirly different can of worms.

However, if you'd be talking about Unicode surrogates,
you'd definitely have a point. How do Java/Perl/Tcl deal with
surrogates?

Just