[Python-ideas] unicodedata.itergraphemes (or str.itergraphemes / str.graphemes)

Sun Jul 7 16:31:38 CEST 2013

On 07/07/2013 11:29, David Kendal wrote:
> Hi,
>
> Python provides a way to iterate characters of a string by using the string as an iterable. But there's no way to iterate over Unicode graphemes (a cluster of characters consisting of a base character plus a number of combining marks and other modifiers -- or what the human eye would consider to be one "character").
>
> I think this ought to be provided either in the unicodedata library, (unicodedata.itergraphemes(string)) which exposes the character database information needed to make this work, or as a method on the built-in str type. (str.itergraphemes() or str.graphemes())
>
> Below is my own implementation of this as a generator, as an example and for reference.
>
> ---
> import unicodedata
>
> def itergraphemes(string):
>      def ismodifier(char): return unicodedata.category(char)[0] == 'M'
>      start = 0
>      for end, char in enumerate(string):
>          if not ismodifier(char) and not start == end:
>              yield string[start:end]
>              start = end
>      yield string[start:]
> ---
>
The definition of a grapheme cluster is actually a little more
complicated than that. See here:

http://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries