[issue5200] unicode.normalize gives wrong result for some characters

Tue Feb 10 11:45:57 CET 2009

New submission from Peter Landgren <peter.talken at telia.com>:

If any of the Swedish characters "åäöÅÄÖ" are input to
unicode.normalize(form, ustr) with form = "NFD" or "NFKD" the result
will be "aaoAAO". "åäöÅÄÖ" are normal character and should be the same
after normalize. They are not connected to aaoAAO other than for
historic reasons, but not in modern languages. It's a common
misinterpretation that the dots and circle above them are diacritic
signs, but those letters should behave as the (Danish)
"Ø" which is normalized correctly.

>From Wikipedia:
Å is often perceived as an A with a ring, interpreting the ring as a
diacritical mark. However, in the languages that use it, the ring is not
considered a diacritic but part of the letter.
The letter Ö in the Swedish and Icelandic alphabets historically arises
from the Germanic umlaut, but it is considered a separate letter from O.
See http://en.wikipedia.org/wiki/%C3%85

I think this is pobably impossible to solve as it will be mixed up with
"umlaut" and you don't know what language the specific word is connected to.

----------
components: Library (Lib)
messages: 81536
nosy: PeterL
severity: normal
status: open
title: unicode.normalize gives wrong result for some characters
type: behavior
versions: Python 2.5

_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue5200>
_______________________________________