[Python-ideas] unicodedata.itergraphemes (or str.itergraphemes / str.graphemes)
David Kendal
me at dpk.io
Sun Jul 7 12:29:03 CEST 2013
Hi,
Python provides a way to iterate characters of a string by using the string as an iterable. But there's no way to iterate over Unicode graphemes (a cluster of characters consisting of a base character plus a number of combining marks and other modifiers -- or what the human eye would consider to be one "character").
I think this ought to be provided either in the unicodedata library, (unicodedata.itergraphemes(string)) which exposes the character database information needed to make this work, or as a method on the built-in str type. (str.itergraphemes() or str.graphemes())
Below is my own implementation of this as a generator, as an example and for reference.
---
import unicodedata
def itergraphemes(string):
def ismodifier(char): return unicodedata.category(char)[0] == 'M'
start = 0
for end, char in enumerate(string):
if not ismodifier(char) and not start == end:
yield string[start:end]
start = end
yield string[start:]
---
Thanks,
dpk
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 841 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20130707/eeca7589/attachment.pgp>
More information about the Python-ideas
mailing list