grapheme cluster library
Rustom Mody
rustompmody at gmail.com
Sat Oct 21 06:08:38 EDT 2017
On Saturday, October 21, 2017 at 11:51:57 AM UTC+5:30, Chris Angelico wrote:
> On Sat, Oct 21, 2017 at 3:25 PM, Stefan Ram wrote:
> > Rustom Mody writes:
> >>Is there a recommended library for manipulating grapheme clusters?
> >
> > The Python Library has a module "unicodedata", with functions like:
> >
> > |unicodedata.normalize( form, unistr )
> > |
> > |Returns the normal form »form« for the Unicode string »unistr«.
> > |Valid values for »form« are »NFC«, »NFKC«, »NFD«, and »NFKD«.
> >
> > . I don't know whether the transformation you are looking for
> > is one of those.
>
> No, that's at a lower level than grapheme clusters.
>
> Rustom, have you looked on PyPI? There are a couple of hits, including
> one simply called "grapheme".
There is this one line solution using regex (or 2 char solution!)
Not perfect but a good start
>>> from regex import findall
>>> veda="""ॐ पूर्णमदः पूर्णमिदं पूर्णात्पुर्णमुदच्यते
पूर्णस्य पूर्णमादाय पूर्णमेवावशिष्यते ॥
ॐ शान्तिः शान्तिः शान्तिः ॥"""
>>> findall(r'\X', veda)
['ॐ', ' ', 'पू', 'र्', 'ण', 'म', 'दः', ' ', 'पू', 'र्', 'ण', 'मि', 'दं', ' ', 'पू', 'र्', 'णा', 'त्', 'पु', 'र्', 'ण', 'मु', 'द', 'च्', 'य', 'ते', '\n', 'पू', 'र्', 'ण', 'स्', 'य', ' ', 'पू', 'र्', 'ण', 'मा', 'दा', 'य', ' ', 'पू', 'र्', 'ण', 'मे', 'वा', 'व', 'शि', 'ष्', 'य', 'ते', ' ', '॥', '\n', 'ॐ', ' ', 'शा', 'न्', 'तिः', ' ', 'शा', 'न्', 'तिः', ' ', 'शा', 'न्', 'तिः', ' ', '॥']
>>>
Compare
>>> [x for x in veda]
['ॐ', ' ', 'प', 'ू', 'र', '्', 'ण', 'म', 'द', 'ः', ' ', 'प', 'ू', 'र', '्', 'ण', 'म', 'ि', 'द', 'ं', ' ', 'प', 'ू', 'र', '्', 'ण', 'ा', 'त', '्', 'प', 'ु', 'र', '्', 'ण', 'म', 'ु', 'द', 'च', '्', 'य', 'त', 'े', '\n', 'प', 'ू', 'र', '्', 'ण', 'स', '्', 'य', ' ', 'प', 'ू', 'र', '्', 'ण', 'म', 'ा', 'द', 'ा', 'य', ' ', 'प', 'ू', 'र', '्', 'ण', 'म', 'े', 'व', 'ा', 'व', 'श', 'ि', 'ष', '्', 'य', 'त', 'े', ' ', '॥', '\n', 'ॐ', ' ', 'श', 'ा', 'न', '्', 'त', 'ि', 'ः', ' ', 'श', 'ा', 'न', '्', 'त', 'ि', 'ः', ' ', 'श', 'ा', 'न', '्', 'त', 'ि', 'ः', ' ', '॥']
What is not working are the vowel-less consonant-joins:
ie ... 'र्', 'ण' ...
[3,4 element of the findall]
should be one 'र्ण'
But its good enough for me for now I think
PS Stefan I dont see your responses unless someone quotes them. Thanks anyway for the inputs
More information about the Python-list
mailing list