grapheme cluster library
Steve D'Aprano
steve+python at pearwood.info
Mon Oct 23 03:45:17 EDT 2017
On Mon, 23 Oct 2017 05:47 pm, Rustom Mody wrote:
> On Monday, October 23, 2017 at 8:06:03 AM UTC+5:30, Lawrence D’Oliveiro
> wrote:
[...]
>> Bear in mind that the logical representation of the text is as code points,
>> graphemes would have more to do with rendering.
>
> Heh! Speak of Euro/Anglo-centrism!
I think that Lawrence may be thinking of glyphs. Glyphs are the display form
that are rendered. Graphemes are the smallest unit of written language.
> In a sane world graphemes would be called letters
Graphemes *aren't* letters.
For starters, not all written languages have an alphabet. No alphabet, no
letters. Even in languages with an alphabet, not all graphemes are letters.
Graphemes include:
- logograms (symbols which represent a morpheme, an entire word, or
a phrase), e.g. Chinese characters, ampersand &, the ™ trademark
or ® registered trademark symbols;
- syllabic characters such as Japanese kana or Cherokee;
- letters of alphabets;
- letters with added diacritics;
- punctuation marks;
- mathematical symbols;
- typographical symbols;
- word separators;
and more. Many linguists also include digraphs (pairs of letters) like the
English "th", "sh", "qu", or "gh" as graphemes.
https://www.thoughtco.com/what-is-a-grapheme-1690916
https://en.wikipedia.org/wiki/Grapheme
> And unicode codepoints would be called something else — letterlets??
> To be fair to the Unicode consortium, they strive hard to call them
> codepoints But in an anglo-centric world, the conflation of codepoint to
> letter is inevitable I guess. To hear how a non Roman-centric view of the
> world would sound: A 'w' is a poorly double-struck 'u'
> A 't' is a crossed 'l'
> Reasonable?
No, T is not a crossed L -- they are unrelated letters and the visual
similarity is a coincidence. They are no more connected than E is just an F
with an extra line.
But you are more right than you knew regarding W: it *literally was* a
doubled-up V (sometimes written U) once upon a time.
For a long time W did not appear in the Latin alphabet, even after people used
it in written text. It was considered a digraph VV then a ligature and
finally, only gradually, a proper letter. As late as the 16th century the
German grammatican Valentin Ickelshamer complained that hardly anyone,
including school masters, knew what to do with W or what it was called.
https://en.wikipedia.org/wiki/W#History
> The lead of https://en.wikipedia.org/wiki/%C3%9C has
>
> | Ü, or ü, is a character…classified as a separate letter in several
> | extended Latin alphabets
> | (including Azeri, Estonian, Hungarian and Turkish), but as the letter U
> | with an umlaut/diaeresis in others such as Catalan, French, Galician,
> | German, Occitan and Spanish.
Indeed: sometimes the same grapheme is considered a letter in one language and
a letter-plus-diacritic in another.
--
Steve
“Cheer up,” they said, “things could be worse.” So I cheered up, and sure
enough, things got worse.
More information about the Python-list
mailing list