[New-bugs-announce] [issue26917] Inconsistency in unicodedata.normalize()?

Armin Rigo report at bugs.python.org
Tue May 3 04:48:27 EDT 2016

New submission from Armin Rigo:

There is an apparent inconsistency in unicodedata.normalize("NFC"), introduced with the switch from the Unicode DB 5.1.0 to 5.2.0 (in Python 2.7).  First, please note that my knowledge of unicode is limited, so I may be wrong and the following behavior might be perfectly correct.

>>> from unicodedata import normalize
>>> print(normalize("NFC", "---\uafb8\u11a7---").encode('utf-8'))
b'---\xea\xbe\xb8\xe1\x86\xa7---'    # i.e., the same as the input

>>> print(normalize("NFC", "---\uafb8\u11a7---\U0002f8a1").encode('utf-8'))

Note how in the second example the initial two-character part is replaced with a single character (actually the first of them).  This does not occur in the first example.  In Python 2.6, both inputs would be normalized to the single-character output.

The new behavior introduced in Python 2.7 is to first do a quick-check on the string, and if this `is_normalized()` function returns 1, we know that the string should already be normalized and we return it unmodified.  However, the example "\uafb8\u11a7" shows a contradictory behavior: it causes both is_normalized() to return 1, but actual normalization to change it.  We can see in the second example above that if, for an unrelated reason, we force is_normalized() to return 0 (by adding some non-normalized character elsewhere in the string), then the "\uafb8\u11a7" is changed.

This is a bit unexpected, but I don't know if it is officially correct behavior or if the problem is a bug in `is_normalized()`.

components: Unicode
messages: 264697
nosy: arigo, ezio.melotti, haypo
priority: normal
severity: normal
status: open
title: Inconsistency in unicodedata.normalize()?
type: behavior
versions: Python 2.7, Python 3.6

Python tracker <report at bugs.python.org>

More information about the New-bugs-announce mailing list