Grapheme clusters, a.k.a.real characters
Rustom Mody
rustompmody at gmail.com
Sun Jul 16 00:33:28 EDT 2017
On Sunday, July 16, 2017 at 4:09:16 AM UTC+5:30, Mikhail V wrote:
> On Sat, 15 Jul 2017 05:50 pm, Marko Rauhamaa wrote:
> > Random access to code points is as uninteresting as random access to
> > UTF-8 bytes.
> > I might want random access to the "Grapheme clusters, a.k.a.real
> > characters".
>
> What _real_ characters are you referring to?
> If your data has "á" (U00E1), then it is one real character,
> if you have "a" (U0061) and "ˊ" (U02CA) then it is _two_
> real characters. So in both cases you have access to code points =
> real characters.
Right now in an adjacent mailing list (debian) I see someone signed off with a
grüß
I guess the third character is a u with some ‘dirt’
Whats the fourth?
>
> For metaphysical discussion - in _my_ definition there
s/metaphysical/linguistic
> is no such "real" character as "á", since it is the "a" glyph with some dirt,
> so according to my definition, it should be two separate characters,
> both semantically and technically seen.
>
> And, in my definition, the whole Unicode is a huge junkyard, to start with.
>
> But opinions may vary, and in case you prefer or forced to write "á",
> then it can be impractical to store it as two characters, regardless of
> encoding.
Heck even in the English that I learnt in school we had
ægis, homœopath etc
And just now looking up:
https://en.wikipedia.org/wiki/List_of_words_that_may_be_spelled_with_a_ligature
I see economics is œconomics!!
Seriously the word "ligature" like the word "grapheme" is misleading
Its not a graphical or typographic notion its an atom of the language's lexicon
No Hindi speaker seeing
क + ई = की
calls the last anything but a letter
And the vowel sign ी is never first class a vowel
More information about the Python-list
mailing list