Python Unicode handling wins again -- mostly

Steven D'Aprano steve at
Tue Dec 3 06:06:26 CET 2013

On Mon, 02 Dec 2013 16:14:13 -0500, Ned Batchelder wrote:

> On 12/2/13 3:38 PM, Ethan Furman wrote:
>> On 11/29/2013 04:44 PM, Steven D'Aprano wrote:
>>> Out of the nine tests, Python 3.3 passes six, with three tests being
>>> failures or dubious. If you believe that the native string type should
>>> operate on code-points, then you'll think that Python does the right
>>> thing.
>> I think Python is doing it correctly.  If I want to operate on
>> "clusters" I'll normalize the string first.
>> Thanks for this excellent post.
>> --
>> ~Ethan~
> This is where my knowledge about Unicode gets fuzzy.  Isn't it the case
> that some grapheme clusters (or whatever the right word is) can't be
> normalized down to a single code point?  Characters can accept many
> accents, for example.  In that case, you can't always normalize and use
> the existing string methods, but would need more specialized code.

That is correct.

If Unicode had a distinct code point for every possible combination of 
base-character plus an arbitrary number of diacritics or accents, the 
0x10FFFF code points wouldn't be anywhere near enough.

I see over 300 diacritics used just in the first 5000 code points. Let's 
pretend that's only 100, and that you can use up to a maximum of 5 at a 
time. That gives 79375496 combinations per base character, much larger 
than the total number of Unicode code points in total.

If anyone wishes to check my logic:

# count distinct combining chars
import unicodedata
s = ''.join(chr(i) for i in range(33, 5000))
s = unicodedata.normalize('NFD', s)
t = [c for c in s if unicodedata.combining(c)]

# calculate the number of combinations
def comb(r, n):
    """Combinations nCr"""
    p = 1
    for i in range(r+1, n+1):
        p *= i
    for i in range(1, n-r+1):
        p /= i
    return p

sum(comb(i, 100) for i in range(6))

I'm not suggesting that all of those accents are necessarily in use in 
the real world, but there are languages which construct arbitrary 
combinations of accents. (Or so I have been lead to believe.) 


More information about the Python-list mailing list