grapheme cluster library

Thomas Jollans tjol at tjol.eu
Mon Oct 23 17:50:38 EDT 2017


On 23/10/17 16:25, Rustom Mody wrote:
> On Monday, October 23, 2017 at 1:15:35 PM UTC+5:30, Steve D'Aprano wrote:
>>
>> and more. Many linguists also include digraphs (pairs of letters) like the
>> English "th", "sh", "qu", or "gh" as graphemes.
>>
>>
>> https://www.thoughtco.com/what-is-a-grapheme-1690916
>>
>> https://en.wikipedia.org/wiki/Grapheme
> 
> Um… Ok So I am using the wrong word? Your first link says:
> | For example, the word 'ghost' contains five letters and four graphemes 
> | ('gh,' 'o,' 's,' and 't')
> 
> Whereas new regex findall does:
> 
>>>> findall(r'\X', "ghost")
> ['g', 'h', 'o', 's', 't']
>>>> findall(r'\X', "church")
> ['c', 'h', 'u', 'r', 'c', 'h']
> 

The definition of a "grapheme" in the Unicode standard does not
necessarily line up with linguistic definition of grapheme for any
particular language.

Even if we assumed that there was a universally agreed definition of the
term for every written language (for English there certainly isn't),
you'd dictionaries information on which language you're dealing with to
pull this trick off.

As an example to illustrate why you'd need dictionaries:

In Dutch, "ij" (the "long IJ", as opposed to the "greek Y") is generally
considered a single letter, or at the very least a single grapheme.
There is a unicode codepoint for it (ij), but it isn't widely used.

So "vrij" (free) has three graphemes (v r ij) and three or four letters.
However, in "bijectie" (bijection), "i" and "j" are two separate
graphemes, so this word has eight letters and seven or eight graphemes.
("ie" may or may not be one single grapheme...)


-- Thomas


PS: This may not be obvious to you at first unless you're Dutch.



More information about the Python-list mailing list