[Python-Dev] Re: Moving to Unicode 3.2

Martin v. Loewis martin@v.loewis.de
24 Oct 2002 14:34:25 +0200


"M.-A. Lemburg" <mal@lemburg.com> writes:

> Still, the changes from Unicode 3.0 to 3.2 are significant (to the
> few users who actually make use of the database):
> 
> 	http://www.unicode.org/unicode/reports/tr28/
> 
> Looking at a diff of the 3.0 and the 3.2 data files you can
> find quite a few additions, changes in categories and several
> removals of numeric interpretations of code points. Most
> important is, of course, that 3.2 actually assing code points
> outside the 16-bit Unicode range.

I still can't see why adding code points beyond the BMP is most
important part of this change. If anything, the changes to categories
might affect applications. In all these cases, I'll assume that the
more recent database is more correct, so any change in behaviour is
likely for the better.

> What I'm really concerned about is that Python is moving on the dev
> edge here while most other technologies are still using Unicode
> 2.0. It's nice to be able to use Python as reference implementation
> for Unicode, but the interoperability between Python and
> Java/Windows suffers from this.

What do you mean "most other technologies"? Windows 2000+ support
UTF-16 at the API quite happily, OpenType supports UCS-4, XML is
defined in terms of Unicode-including-extended-planes, Unicode
normalization (mandated by all W3C recommendations) is defined in
terms of Unicode 3.1, the character database of JDK 1.4 is based on
Unicode 3.0. I don't think there are that many technologies which
still use Unicode 2.0 (explicitly: most applications don't actually
care whether a code point is assigned or not).

> Well, I should have written: ... while the code does not even
> fully implement Unicode 3.0.
> 
> Fortunately, you have already started working in that
> direction (adding normalization) and I'd like to thank
> you for your efforts.

While looking at normalization, I noticed an aspect that still
confuses me:

When new characters are added to Unicode, you can get additional
canonical decompositions. Applying the composition algorithm
mechanically, that could cause certain normal forms to change
depending on the Unicode version. Therefore, TR#15 defines a
composition version of Unicode to be used in normalization, see

http://www.unicode.org/unicode/reports/tr15/#Versioning

This version is defined to be Unicode 3.1.0. 

In turn, it seems that algorithms that perform normalization based on
older databases are incorrect (algorithms based on newer databases
must take the exclusion tables into account, which my patch does)
[this is the confusing part: how wrong are normalization algorithms
 based on, say, Unicode 3.0?]

So for this specific aspect, updating the Unicode database was really
necessary.

>  > Apparently, this impression is wrong. Can you please give precise
>  > instructions what constitutes "such a change"?
> 
> I consider changes in the design or additions that affect
> the design such a change.

That is still unclear. What's an addition? I did not really add
anything - I changed what was already there. Nor do I think I changed
the design.

Regards,
Martin