<p dir="ltr">How would a grapheme library work? Basic cluster combination, or would implementing other algorithms (line break, normalizing to a "canonical" form) be necessary?</p>

<p dir="ltr">How do people use grapheme clusters in non-rendering situations? Or here's perhaps here's a better question: does anyone know any non-latin (Japanese and Arabic come to mind)  speakers who use python to process text in their own language? Who could perhaps tell us what most bugs them about python's current api and which standard libraries need work.</p>


<div class="gmail_quote">On Dec 2, 2013 10:10 PM, "Steven D'Aprano" <<a href="mailto:steve@pearwood.info">steve@pearwood.info</a>> wrote:<br type="attribution"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

On Mon, 02 Dec 2013 16:14:13 -0500, Ned Batchelder wrote:<br>

<br>

> On 12/2/13 3:38 PM, Ethan Furman wrote:<br>

>> On 11/29/2013 04:44 PM, Steven D'Aprano wrote:<br>

>>><br>

>>> Out of the nine tests, Python 3.3 passes six, with three tests being<br>

>>> failures or dubious. If you believe that the native string type should<br>

>>> operate on code-points, then you'll think that Python does the right<br>

>>> thing.<br>

>><br>

>> I think Python is doing it correctly.  If I want to operate on<br>

>> "clusters" I'll normalize the string first.<br>

>><br>

>> Thanks for this excellent post.<br>

>><br>

>> --<br>

>> ~Ethan~<br>

><br>

> This is where my knowledge about Unicode gets fuzzy.  Isn't it the case<br>

> that some grapheme clusters (or whatever the right word is) can't be<br>

> normalized down to a single code point?  Characters can accept many<br>

> accents, for example.  In that case, you can't always normalize and use<br>

> the existing string methods, but would need more specialized code.<br>

<br>

That is correct.<br>

<br>

If Unicode had a distinct code point for every possible combination of<br>

base-character plus an arbitrary number of diacritics or accents, the<br>

0x10FFFF code points wouldn't be anywhere near enough.<br>

<br>

I see over 300 diacritics used just in the first 5000 code points. Let's<br>

pretend that's only 100, and that you can use up to a maximum of 5 at a<br>

time. That gives 79375496 combinations per base character, much larger<br>

than the total number of Unicode code points in total.<br>

<br>

If anyone wishes to check my logic:<br>

<br>

# count distinct combining chars<br>

import unicodedata<br>

s = ''.join(chr(i) for i in range(33, 5000))<br>

s = unicodedata.normalize('NFD', s)<br>

t = [c for c in s if unicodedata.combining(c)]<br>

len(set(t))<br>

<br>

# calculate the number of combinations<br>

def comb(r, n):<br>

    """Combinations nCr"""<br>

    p = 1<br>

    for i in range(r+1, n+1):<br>

        p *= i<br>

    for i in range(1, n-r+1):<br>

        p /= i<br>

    return p<br>

<br>

sum(comb(i, 100) for i in range(6))<br>

<br>

<br>

I'm not suggesting that all of those accents are necessarily in use in<br>

the real world, but there are languages which construct arbitrary<br>

combinations of accents. (Or so I have been lead to believe.)<br>

<br>

<br>

--<br>

Steven<br>

--<br>

<a href="https://mail.python.org/mailman/listinfo/python-list" target="_blank">https://mail.python.org/mailman/listinfo/python-list</a><br>

</blockquote></div>