grapheme cluster library

Rustom Mody rustompmody at gmail.com
Sat Oct 21 06:08:38 EDT 2017


On Saturday, October 21, 2017 at 11:51:57 AM UTC+5:30, Chris Angelico wrote:
> On Sat, Oct 21, 2017 at 3:25 PM, Stefan Ram wrote:
> > Rustom Mody  writes:
> >>Is there a recommended library for manipulating grapheme clusters?
> >
> >   The Python Library has a module "unicodedata", with functions like:
> >
> > |unicodedata.normalize( form, unistr )
> > |
> > |Returns the normal form »form« for the Unicode string »unistr«.
> > |Valid values for »form« are »NFC«, »NFKC«, »NFD«, and »NFKD«.
> >
> >   . I don't know whether the transformation you are looking for
> >   is one of those.
> 
> No, that's at a lower level than grapheme clusters.
> 
> Rustom, have you looked on PyPI? There are a couple of hits, including
> one simply called "grapheme".

There is this one line solution using regex (or 2 char solution!)
Not perfect but a good start

>>> from regex import findall
>>> veda="""ॐ पूर्णमदः पूर्णमिदं पूर्णात्पुर्णमुदच्यते
पूर्णस्य पूर्णमादाय पूर्णमेवावशिष्यते ॥
ॐ शान्तिः शान्तिः शान्तिः ॥"""

>>> findall(r'\X', veda)
['ॐ', ' ', 'पू', 'र्', 'ण', 'म', 'दः', ' ', 'पू', 'र्', 'ण', 'मि', 'दं', ' ', 'पू', 'र्', 'णा', 'त्', 'पु', 'र्', 'ण', 'मु', 'द', 'च्', 'य', 'ते', '\n', 'पू', 'र्', 'ण', 'स्', 'य', ' ', 'पू', 'र्', 'ण', 'मा', 'दा', 'य', ' ', 'पू', 'र्', 'ण', 'मे', 'वा', 'व', 'शि', 'ष्', 'य', 'ते', ' ', '॥', '\n', 'ॐ', ' ', 'शा', 'न्', 'तिः', ' ', 'शा', 'न्', 'तिः', ' ', 'शा', 'न्', 'तिः', ' ', '॥']
>>> 

Compare

>>> [x for x in veda]
['ॐ', ' ', 'प', 'ू', 'र', '्', 'ण', 'म', 'द', 'ः', ' ', 'प', 'ू', 'र', '्', 'ण', 'म', 'ि', 'द', 'ं', ' ', 'प', 'ू', 'र', '्', 'ण', 'ा', 'त', '्', 'प', 'ु', 'र', '्', 'ण', 'म', 'ु', 'द', 'च', '्', 'य', 'त', 'े', '\n', 'प', 'ू', 'र', '्', 'ण', 'स', '्', 'य', ' ', 'प', 'ू', 'र', '्', 'ण', 'म', 'ा', 'द', 'ा', 'य', ' ', 'प', 'ू', 'र', '्', 'ण', 'म', 'े', 'व', 'ा', 'व', 'श', 'ि', 'ष', '्', 'य', 'त', 'े', ' ', '॥', '\n', 'ॐ', ' ', 'श', 'ा', 'न', '्', 'त', 'ि', 'ः', ' ', 'श', 'ा', 'न', '्', 'त', 'ि', 'ः', ' ', 'श', 'ा', 'न', '्', 'त', 'ि', 'ः', ' ', '॥']

What is not working are the vowel-less consonant-joins: 
ie ... 'र्', 'ण' ... 
[3,4 element of the findall]
should be one 'र्ण'

But its good enough for me for now I think

PS Stefan I dont see your responses unless someone quotes them. Thanks anyway for the inputs


More information about the Python-list mailing list