[issue5200] unicode.normalize gives wrong result for some characters

Peter Landgren report at bugs.python.org
Wed Feb 11 09:24:06 CET 2009


Peter Landgren <peter.talken at telia.com> added the comment:

> Martin v. Löwis <martin at v.loewis.de> added the comment:
> > The same applies  "Å" and "A", "Ä" and "A" and "Ö" and "O"
> > which also are also different letters as "Ø" and "O" are.
>
> Sure. And rightfully, they "Å" is *not* (I repeat: not)
> normalized as "A", under NFD:
>
> py> unicodedata.normalize("NFD", u"Å")
> u'A\u030a'
>
> > Maybe not in the unicode world but in treal life.
>
> They are different letters also in the Unicode world.
>
> > That's why I'm a little confused.
>
> I think the confusion comes from your assumption that
> normalizing "Å" produces "A". It does not. Really not.

Yes, you are right.

However the confusion/problem shows up when it is used in the application to
build an alphabet and group for example all version of E, É, È, Ë, Ê
together under E. The first character in the result of normalize is
used to build alphabet labels for surnames:

letter = normalize("NFD", surname)[0].upper()
if letter != last_letter:
    last_letter = letter
....
and this is why I get "A" when the surname begins with "Å".

This way it works for all variations of E to be grouped under "E",
but fails as "Å" is shown under the label "A", not the "A" in the
beginning of the alphabet but after "Z", where "ÅÄÖ" comes.
So a previous sorting of the surnames works correctly.
(The Swedish alphabet has 29 letters: A,B,C... X,Y,Z,Å,Ä,Ö)

Can you think of any solution to this conflict? 

u'\xd8'

u'A\u030a'

u'\xc5'

This is obviously the result of how the unicode spec is written
interpreting "Å" as a variation of "A". which it is not.

I have asked the unicode people, but not got any answer yet.

The application is GRAMPS: http://gramps-project.org/

Once again thanks for make some of the unicode stuff clear!
Regards,
Peter Landgren

Added file: http://bugs.python.org/file13025/unnamed

_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue5200>
_______________________________________
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: unnamed
URL: <http://mail.python.org/pipermail/python-bugs-list/attachments/20090211/7f83cd5c/attachment.txt>


More information about the Python-bugs-list mailing list