grapheme cluster library
Rustom Mody
rustompmody at gmail.com
Mon Oct 23 10:25:29 EDT 2017
On Monday, October 23, 2017 at 1:15:35 PM UTC+5:30, Steve D'Aprano wrote:
> On Mon, 23 Oct 2017 05:47 pm, Rustom Mody wrote:
>
> > On Monday, October 23, 2017 at 8:06:03 AM UTC+5:30, Lawrence D’Oliveiro
> > wrote:
> [...]
> >> Bear in mind that the logical representation of the text is as code points,
> >> graphemes would have more to do with rendering.
> >
> > Heh! Speak of Euro/Anglo-centrism!
>
> I think that Lawrence may be thinking of glyphs. Glyphs are the display form
> that are rendered. Graphemes are the smallest unit of written language.
>
>
> > In a sane world graphemes would be called letters
>
> Graphemes *aren't* letters.
>
> For starters, not all written languages have an alphabet. No alphabet, no
> letters. Even in languages with an alphabet, not all graphemes are letters.
>
> Graphemes include:
>
> - logograms (symbols which represent a morpheme, an entire word, or
> a phrase), e.g. Chinese characters, ampersand &, the ™ trademark
> or ® registered trademark symbols;
>
> - syllabic characters such as Japanese kana or Cherokee;
>
> - letters of alphabets;
>
> - letters with added diacritics;
>
> - punctuation marks;
>
> - mathematical symbols;
>
> - typographical symbols;
>
> - word separators;
>
> and more. Many linguists also include digraphs (pairs of letters) like the
> English "th", "sh", "qu", or "gh" as graphemes.
>
>
> https://www.thoughtco.com/what-is-a-grapheme-1690916
>
> https://en.wikipedia.org/wiki/Grapheme
Um… Ok So I am using the wrong word? Your first link says:
| For example, the word 'ghost' contains five letters and four graphemes
| ('gh,' 'o,' 's,' and 't')
Whereas new regex findall does:
>>> findall(r'\X', "ghost")
['g', 'h', 'o', 's', 't']
>>> findall(r'\X', "church")
['c', 'h', 'u', 'r', 'c', 'h']
More information about the Python-list
mailing list