[issue5200] unicode.normalize gives wrong result for some characters

Martin v. Löwis report at bugs.python.org
Wed Feb 11 20:54:26 CET 2009

Martin v. Löwis <martin at v.loewis.de> added the comment:

> The È... comes from French surnames and our French developer wants to group all versions 
> of E together. The É... can be found in French surnames in Sweden as well as in Germany.
> The program, GRAMPS is a genealogy program used in about 20 languages, so there is no 
> preferred language.

I think you'll find that you have to think much harder about collation,
then. If you assume that the Unicode ordinal order will give right
collation, it will be wrong many times, I predict.

For example, it appears that Croatian puts Dž as a single letter between
D and Đ.

> I think we have found a solution that can handle most cases.
> We treat surnames beginning with "ÅÄÖ" special. I don't think that there are many surnames 
> outside the Nordic countries that starts with any of these three letters.

It seems they are also common in Turkish (Öksüz, Ölcüm, Önal, ..., taken
from the Berlin phonebook), and Turkish puts Ö after O. Hungarian also
uses Ö and Ü (as well as Ó, Ú, Ő, Ű), but I don't know how common they
are as first letters of surnames.

Python tracker <report at bugs.python.org>

More information about the Python-bugs-list mailing list