[I18n-sig] Re: [Python-Dev] Unicode debate

Just van Rossum just@letterror.com
Wed, 3 May 2000 07:47:07 +0100


[MAL vs. PP]
>> > FYI: Normalization is needed to make comparing Unicode
>> > strings robust, e.g. u"=E9" should compare equal to u"e\u0301".
>>
>> That's a whole 'nother debate at a whole 'nother level of abstraction. I
>> think we need to get the bytes/characters level right and then we can
>> worry about display-equivalent characters (or leave that to the Python
>> programmer to figure out...).
>
>I just wanted to point out that the argument "slicing doesn't
>work with UTF-8" is moot.

And failed...

I asked two Unicode guru's I happen to know about the normalization issue
(which is indeed not relevant to the current discussion, but it's
fascinating nevertheless!).

(Sorry about the possibly wrong email encoding... "=E8" is u"\350", "=F6" is
u"\366")

John Jenkins replied:
"""
Well, I'm not sure you want to hear the answer -- but it really depends on
what the language is attempting to do.

By and large, Unicode takes the position that "e`" should always be treated
the same as "=E8". This is a *semantic* equivalence -- that is, they *mean*
the same thing -- and doesn't depend on the display engine to be true.
Unicode also provides a default collation algorithm
(http://www.unicode.org/unicode/reports/tr10/).

At the same time, the standard acknowledges that in real life, string
comparison and collation are complicated, language-specific problems
requiring a lot of work and interaction with the user to do right.

>From the perspective of a programming language, it would best be served IMH=
O
by implementing the contents of TR10 for string comparison and collation.
That would make "e`" and "=E8" come out as equivalent.
"""


Dave Opstad replied:
"""
Unicode talks about "canonical decomposition" in order to make it easier
to answer questions like yours. Specifically, in the Unicode 3.0
standard, rule D24 in section 3.6 (page 44) states that:

"Two character sequences are said to be canonical equivalents if their
full canonical decompositions are identical. For example, the sequences
<o, combining-diaeresis> and <=F6> are canonical equivalents. Canonical
equivalence is a Unicode propert. It should not be confused with
language-specific collation or matching, which may add additional
equivalencies."

So they still have language-specific differences, even if Unicode sees
them as canonically equivalent.

You might want to check this out:

http://www.unicode.org/unicode/reports/tr15/tr15-18.html

It's the latest technical report on these issues, which may help clarify
things further.
"""


It's very deep stuff, which seems more appropriate for an extension than
for builtin comparisons to me.

Just