[Python-ideas] unicodedata.itergraphemes (or str.itergraphemes / str.graphemes)

Mon Jul 8 22:41:20 CEST 2013

On 08/07/2013 21:26, Bruce Leban wrote:
>
> On Mon, Jul 8, 2013 at 1:02 PM, David Mertz <mertz at gnosis.cx
> <mailto:mertz at gnosis.cx>> wrote:
>
>     I think the API Bruce suggests, along with its module location in
>     'unicodedata' makes more sense than the iterator only.
>
>     But it seems to me that it would still be useful to explicitly break
>     a string into its component clusters with a similar function.  E.g.:
>
>        graphemes = unicodedata.grapheme_clusters(str)  # Returns an
>     iterator of strings, often single characters
>        for g in graphemes: ...
>
>     It wouldn't be very hard to implement 'grapheme_clusters' in terms
>     of the API Bruce suggests, but I feel like it should have a standard
>     name and API along with those others.  Actually, I guess the
>     implementation is just:
>
>        def grapheme_clusters(s):
>            for i in range(len(str)):
>                if i == unicodedata.grapheme_start(s, i):
>                    yield unicodedata.grapheme_cluster(s, i)
>
>
> Yes, I still think the iterator is useful. I'd use the following
> implementation instead as the above is going to find the start of each
> multi-char grapheme multiple times.
>
>     def grapheme_clusters(s):
>          if len(str):
>              i = 0
>              while i is not None:
>                  yield unicodedata.grapheme_cluster(s, i)
>                  i = unicodedata.grapheme_next(str, i)
>
>
> This does "if len(str)" at the top rather than just "if str" so it
> raises if passed a non-iterable like None rather than silently accepting it.
>
If it's any help, the alternative regex implementation at:

http://pypi.python.org/pypi/regex

supports matching graphemes, although that bit is written in C.