[Python-ideas] unicodedata.itergraphemes (or str.itergraphemes / str.graphemes)

Sun Jul 7 12:29:03 CEST 2013

Hi,

Python provides a way to iterate characters of a string by using the string as an iterable. But there's no way to iterate over Unicode graphemes (a cluster of characters consisting of a base character plus a number of combining marks and other modifiers -- or what the human eye would consider to be one "character").

I think this ought to be provided either in the unicodedata library, (unicodedata.itergraphemes(string)) which exposes the character database information needed to make this work, or as a method on the built-in str type. (str.itergraphemes() or str.graphemes())

Below is my own implementation of this as a generator, as an example and for reference.

---
import unicodedata

def itergraphemes(string):
    def ismodifier(char): return unicodedata.category(char)[0] == 'M'
    start = 0
    for end, char in enumerate(string):
        if not ismodifier(char) and not start == end:
            yield string[start:end]
            start = end
    yield string[start:]
---

Thanks,

dpk
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 841 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20130707/eeca7589/attachment.pgp>