How to waste computer memory?
Chris Angelico
rosuav at gmail.com
Sat Mar 19 10:14:41 EDT 2016
On Sat, Mar 19, 2016 at 11:42 PM, Marko Rauhamaa <marko at pacujo.net> wrote:
>> The problem is not so much the existence of combining characters, but that
>> *some* but not all accented characters are available in two forms: a
>> composed single code point, and a decomposed pair of code points.
>
> Also, is an a with ring on top and another ring on bottom the same
> character as an a with ring on bottom and another ring on top?
Unicode has an answer for this one. It's called normalization, and
actually it doesn't quite go as far as I thought, but it does at least
solve this exact question.
>>> print(ascii(unicodedata.normalize("NFC","a\u0325\u030a")))
'\u1e01\u030a'
>>> print(ascii(unicodedata.normalize("NFC","a\u030a\u0325")))
'\u1e01\u030a'
>>> print(ascii(unicodedata.normalize("NFD","a\u0325\u030a")))
'a\u0325\u030a'
>>> print(ascii(unicodedata.normalize("NFD","a\u030a\u0325")))
'a\u0325\u030a'
So yes, they are the same combined character. Whether you ask for the
composed form or the decomposed form, you get the exact same sequence
of codepoints from either initial ordering - either this:
'a' LATIN SMALL LETTER A
'\u0325' COMBINING RING BELOW
'\u030a' COMBINING RING ABOVE
or this:
'\u1e01' LATIN SMALL LETTER A WITH RING BELOW
'\u030a' COMBINING RING ABOVE
but never this:
'\xe5' LATIN SMALL LETTER A WITH RING ABOVE
'\u0325' COMBINING RING BELOW
which will normalize to either of the above.
I had been of the belief that NFC/NFD normalization would *always*
provide a canonical ordering for the combining characters, but
apparently only some are affected:
>>> print(ascii(unicodedata.normalize("NFC","q\u0303\u0301")))
'q\u0303\u0301'
>>> print(ascii(unicodedata.normalize("NFC","q\u0301\u0303")))
'q\u0301\u0303'
(And NFK[CD] doesn't change this either.) But if you're really worried
about these kinds of equivalencies, you could write your own
"super-normalize" function which first NFKD normalizes, then sorts all
sequences of combining characters into codepoint order, and finally
NFKC or NFKD normalizes to canonicalize everything.
ChrisA
More information about the Python-list
mailing list