Re: Moving to Unicode 3.2 ([Python-checkins] python/dist/src/Objects unicodectype.c,2.11,2.12 unicodetype_db.h,1.4,1.5)

loewis@users.sourceforge.net wrote:
I haven't seen any messages about this on python-dev. Did I miss something ? The switch from Unicode 3.0 is a big one since 3.2 introduces non-BMP character points for the first time. I also don't think that it is a good idea to ship the Unicode 3.2 database while the code behaves as defined in Unicode 3.0. And last not least, I'd like to be asked before you make such changes. Thanks, -- Marc-Andre Lemburg CEO eGenix.com Software GmbH _______________________________________________________________________ eGenix.com -- Makers of the Python mx Extensions: mxDateTime,mxODBC,... Python Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/

"M.-A. Lemburg" <mal@lemburg.com> writes:
I haven't seen any messages about this on python-dev. Did I miss something ?
No. For a change like this, I did not think consultation was necessary.
The switch from Unicode 3.0 is a big one since 3.2 introduces non-BMP character points for the first time.
I disagree; it's a small change. Just look at the patch itself: apart from the (considerably large) generated data, there were very few actual changes to source code: Changing a few limits was sufficient. Since there are no backwards compatibility issues, and no design choices (apart from the choice of updating the database at all), this is a straight-forward change.
I also don't think that it is a good idea to ship the Unicode 3.2 database while the code behaves as defined in Unicode 3.0.
Can you please elaborate? What code behaves as defined in Unicode 3.0 that is incompatible with the Unicode 3.2 database?
And last not least, I'd like to be asked before you make such changes.
I find this quite a possessive view, and I would prefer if you bring up technical arguments instead of procedural ones, but ok... I was under the impression that I can apply my own professional judgement when deciding what patches to apply without consultation, in what cases to ask on python-dev, and when to submit a patch to SF. Apparently, this impression is wrong. Can you please give precise instructions what constitutes "such a change"? Also, should I back this change out? Regards, Martin

Martin v. Loewis wrote:
I saw that :-)
Which underlines Fredrik's good design of the Unicode database. Still, the changes from Unicode 3.0 to 3.2 are significant (to the few users who actually make use of the database): http://www.unicode.org/unicode/reports/tr28/ Looking at a diff of the 3.0 and the 3.2 data files you can find quite a few additions, changes in categories and several removals of numeric interpretations of code points. Most important is, of course, that 3.2 actually assing code points outside the 16-bit Unicode range. What I'm really concerned about is that Python is moving on the dev edge here while most other technologies are still using Unicode 2.0. It's nice to be able to use Python as reference implementation for Unicode, but the interoperability between Python and Java/Windows suffers from this.
Well, I should have written: ... while the code does not even fully implement Unicode 3.0. Fortunately, you have already started working in that direction (adding normalization) and I'd like to thank you for your efforts.
I am not trying to possess anything here. This is about managing the Unicode code base and I'm still under the impression that I'm the one in charge here.
I consider changes in the design or additions that affect the design such a change.
Also, should I back this change out?
No, let's first find out what the consequences of this change are and then decide whether it's a good idea or not. -- Marc-Andre Lemburg CEO eGenix.com Software GmbH _______________________________________________________________________ eGenix.com -- Makers of the Python mx Extensions: mxDateTime,mxODBC,... Python Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/

"M.-A. Lemburg" <mal@lemburg.com> writes:
I still can't see why adding code points beyond the BMP is most important part of this change. If anything, the changes to categories might affect applications. In all these cases, I'll assume that the more recent database is more correct, so any change in behaviour is likely for the better.
What do you mean "most other technologies"? Windows 2000+ support UTF-16 at the API quite happily, OpenType supports UCS-4, XML is defined in terms of Unicode-including-extended-planes, Unicode normalization (mandated by all W3C recommendations) is defined in terms of Unicode 3.1, the character database of JDK 1.4 is based on Unicode 3.0. I don't think there are that many technologies which still use Unicode 2.0 (explicitly: most applications don't actually care whether a code point is assigned or not).
While looking at normalization, I noticed an aspect that still confuses me: When new characters are added to Unicode, you can get additional canonical decompositions. Applying the composition algorithm mechanically, that could cause certain normal forms to change depending on the Unicode version. Therefore, TR#15 defines a composition version of Unicode to be used in normalization, see http://www.unicode.org/unicode/reports/tr15/#Versioning This version is defined to be Unicode 3.1.0. In turn, it seems that algorithms that perform normalization based on older databases are incorrect (algorithms based on newer databases must take the exclusion tables into account, which my patch does) [this is the confusing part: how wrong are normalization algorithms based on, say, Unicode 3.0?] So for this specific aspect, updating the Unicode database was really necessary.
That is still unclear. What's an addition? I did not really add anything - I changed what was already there. Nor do I think I changed the design. Regards, Martin

"M.-A. Lemburg" <mal@lemburg.com> writes:
I haven't seen any messages about this on python-dev. Did I miss something ?
No. For a change like this, I did not think consultation was necessary.
The switch from Unicode 3.0 is a big one since 3.2 introduces non-BMP character points for the first time.
I disagree; it's a small change. Just look at the patch itself: apart from the (considerably large) generated data, there were very few actual changes to source code: Changing a few limits was sufficient. Since there are no backwards compatibility issues, and no design choices (apart from the choice of updating the database at all), this is a straight-forward change.
I also don't think that it is a good idea to ship the Unicode 3.2 database while the code behaves as defined in Unicode 3.0.
Can you please elaborate? What code behaves as defined in Unicode 3.0 that is incompatible with the Unicode 3.2 database?
And last not least, I'd like to be asked before you make such changes.
I find this quite a possessive view, and I would prefer if you bring up technical arguments instead of procedural ones, but ok... I was under the impression that I can apply my own professional judgement when deciding what patches to apply without consultation, in what cases to ask on python-dev, and when to submit a patch to SF. Apparently, this impression is wrong. Can you please give precise instructions what constitutes "such a change"? Also, should I back this change out? Regards, Martin

Martin v. Loewis wrote:
I saw that :-)
Which underlines Fredrik's good design of the Unicode database. Still, the changes from Unicode 3.0 to 3.2 are significant (to the few users who actually make use of the database): http://www.unicode.org/unicode/reports/tr28/ Looking at a diff of the 3.0 and the 3.2 data files you can find quite a few additions, changes in categories and several removals of numeric interpretations of code points. Most important is, of course, that 3.2 actually assing code points outside the 16-bit Unicode range. What I'm really concerned about is that Python is moving on the dev edge here while most other technologies are still using Unicode 2.0. It's nice to be able to use Python as reference implementation for Unicode, but the interoperability between Python and Java/Windows suffers from this.
Well, I should have written: ... while the code does not even fully implement Unicode 3.0. Fortunately, you have already started working in that direction (adding normalization) and I'd like to thank you for your efforts.
I am not trying to possess anything here. This is about managing the Unicode code base and I'm still under the impression that I'm the one in charge here.
I consider changes in the design or additions that affect the design such a change.
Also, should I back this change out?
No, let's first find out what the consequences of this change are and then decide whether it's a good idea or not. -- Marc-Andre Lemburg CEO eGenix.com Software GmbH _______________________________________________________________________ eGenix.com -- Makers of the Python mx Extensions: mxDateTime,mxODBC,... Python Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/

"M.-A. Lemburg" <mal@lemburg.com> writes:
I still can't see why adding code points beyond the BMP is most important part of this change. If anything, the changes to categories might affect applications. In all these cases, I'll assume that the more recent database is more correct, so any change in behaviour is likely for the better.
What do you mean "most other technologies"? Windows 2000+ support UTF-16 at the API quite happily, OpenType supports UCS-4, XML is defined in terms of Unicode-including-extended-planes, Unicode normalization (mandated by all W3C recommendations) is defined in terms of Unicode 3.1, the character database of JDK 1.4 is based on Unicode 3.0. I don't think there are that many technologies which still use Unicode 2.0 (explicitly: most applications don't actually care whether a code point is assigned or not).
While looking at normalization, I noticed an aspect that still confuses me: When new characters are added to Unicode, you can get additional canonical decompositions. Applying the composition algorithm mechanically, that could cause certain normal forms to change depending on the Unicode version. Therefore, TR#15 defines a composition version of Unicode to be used in normalization, see http://www.unicode.org/unicode/reports/tr15/#Versioning This version is defined to be Unicode 3.1.0. In turn, it seems that algorithms that perform normalization based on older databases are incorrect (algorithms based on newer databases must take the exclusion tables into account, which my patch does) [this is the confusing part: how wrong are normalization algorithms based on, say, Unicode 3.0?] So for this specific aspect, updating the Unicode database was really necessary.
That is still unclear. What's an addition? I did not really add anything - I changed what was already there. Nor do I think I changed the design. Regards, Martin
participants (2)
-
M.-A. Lemburg
-
martin@v.loewis.de