[issue10254] unicodedata.normalize('NFC', s) regression

Fri Dec 17 02:34:52 CET 2010

Alexander Belopolsky <belopolsky at users.sourceforge.net> added the comment:

The logic suggested by Martin in msg120018 looks right to me, but the whole code seems to be unnecessarily complex.  (And comb1==comb may need to be changed to comb1>=comb.) I don't understand why linear search through "skipped" array is needed.  At the very least instead of adding their positions to the "skipped" list, used combining characters can be replaced by a non-character to be later skipped.  A better algorithm should be able to avoid the whole issue of "skipping" by properly computing the length of the decomposed character.  See internalCompose() at http://www.unicode.org/reports/tr15/Normalizer.java.

I'll try to come up with a patch.

----------
assignee:  -> belopolsky

_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue10254>
_______________________________________