[Python-checkins] r42954 - in python/trunk: Doc/lib/libunicodedata.tex Include/ucnhash.h Lib/encodings/idna.py Lib/stringprep.py Modules/unicodedata.c

"Martin v. Löwis" martin at v.loewis.de
Mon Mar 13 23:45:49 CET 2006


[as Thomas points out, this is on python-checkins, so continuing in
 English]

> Falsch, weil der Patch wesentlich komplexer ist, als zur
> Lösung des Problems nötig gewesen wäre und man nun auch in Zukunft
> stets mehrere Versionen der Datenbank bereithalten muß, anstatt
> einfach mehrere Module dafür bereitzuhalten, die je nach Bedarf
> hinzugeladen werden können.

Well, "the problem" to be solved was not merely to provide two versions
of the database, but also in a space-efficient way. All this effort
in trying to squeeze the size of the data would be wasted when it
then gets double just because two versions of the database must
be provided.

> Es wird auch nicht möglich sein, die alten Versionen ohne Problem
> abzutrennen, so daß bei einer Erweiterung der Datenbank um weitere
> Felder oder Informationen, Probleme mit der Synchronisierung der
> Datenbank entstehen werden.

There is no need to strip the old version. Parts of the library
rely on the old version specifically, and these parts are not going
to go away for a foreseeable future, nor does the need go away that
these libraries need the version 3.2 of the Unicode database.
IDNA is simply not going to change in that respect, for several
years to come.

*If* there is a need to strip off 3.2 at some point, this is
very easily done through a slight modification to
makeunicodedata.py.

>>Das ist ja genau der Trick: sie müssen das nicht. Die Unterstützung
>>von Unicode 3.2 kostet nur 18kB.
> 
> 
> Das ist in der Tat wenig.

That's because only the changed records are collected, plus a list
of characters that were unassigned in 3.2 but are defined in 4.1.

In principle, there should not be a single changed record. In practive,
a few records have changed - mostly changes to the character category.
As a matter of principle, the names of a character never change in
Unicode (this is a promise the consortium and ISO make), and, as a
similar principle, the normalization never changes except for clear
errors.

There are only five characters for which normalization changed
between between 3.2 and 4.1; I generate a C function for these.
Interestingly enough, these changes are one of the primary reasons
why some people in IETF despise the notion of updating IDNA:
This would be a change in wire protocol, with potential
security implications (i.e. it might allow for phishing). In
these cases, the potential for phishing is really minimal -
but it exists, which means proposals to update IDNA will meet
strong resistance.

It might be possible to reduce the table of changes even further,
using a three-level trie, if desired.

Regards,
Martin


More information about the Python-checkins mailing list