[Python-ideas] unicodedata.itergraphemes (or str.itergraphemes / str.graphemes)

Mon Jul 8 22:26:50 CEST 2013

On Mon, Jul 8, 2013 at 1:02 PM, David Mertz <mertz at gnosis.cx> wrote:

> I think the API Bruce suggests, along with its module location in
> 'unicodedata' makes more sense than the iterator only.
>
> But it seems to me that it would still be useful to explicitly break a
> string into its component clusters with a similar function.  E.g.:
>
>   graphemes = unicodedata.grapheme_clusters(str)  # Returns an iterator of
> strings, often single characters
>   for g in graphemes: ...
>
> It wouldn't be very hard to implement 'grapheme_clusters' in terms of the
> API Bruce suggests, but I feel like it should have a standard name and API
> along with those others.  Actually, I guess the implementation is just:
>
>   def grapheme_clusters(s):
>       for i in range(len(str)):
>           if i == unicodedata.grapheme_start(s, i):
>               yield unicodedata.grapheme_cluster(s, i)
>

Yes, I still think the iterator is useful. I'd use the following
implementation instead as the above is going to find the start of each
multi-char grapheme multiple times.

def grapheme_clusters(s):
    if len(str):
        i = 0
        while i is not None:
            yield unicodedata.grapheme_cluster(s, i)
            i = unicodedata.grapheme_next(str, i)

This does "if len(str)" at the top rather than just "if str" so it raises
if passed a non-iterable like None rather than silently accepting it.

--- Bruce
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20130708/de059c8b/attachment.html>