grapheme cluster library (Posting On Python-List Prohibited)
Rustom Mody
rustompmody at gmail.com
Mon Oct 23 02:47:02 EDT 2017
On Monday, October 23, 2017 at 8:06:03 AM UTC+5:30, Lawrence D’Oliveiro wrote:
> On Saturday, October 21, 2017 at 5:11:13 PM UTC+13, Rustom Mody wrote:
> > Is there a recommended library for manipulating grapheme clusters?
>
> Is this <http://anoopkunchukuttan.github.io/indic_nlp_library/> any good?
Thanks looks promising.
Dunno how much it lives up to the claims
[For now the one liner from regex's findall has sufficed:
findall(r'\X', «text»)
[Thanks MRAB for the library]
> Bear in mind that the logical representation of the text is as code points, graphemes would have more to do with rendering.
Heh! Speak of Euro/Anglo-centrism!
In a sane world graphemes would be called letters
And unicode codepoints would be called something else — letterlets??
To be fair to the Unicode consortium, they strive hard to call them codepoints
But in an anglo-centric world, the conflation of codepoint to letter is inevitable I guess.
To hear how a non Roman-centric view of the world would sound:
A 'w' is a poorly double-struck 'u'
A 't' is a crossed 'l'
Reasonable?
The lead of https://en.wikipedia.org/wiki/%C3%9C has
| Ü, or ü, is a character…classified as a separate letter in several extended
Latin alphabets
| (including Azeri, Estonian, Hungarian and Turkish), but as the letter U with an
| umlaut/diaeresis in others such as Catalan, French, Galician, German, Occitan
and Spanish.
More information about the Python-list
mailing list