[Python-ideas] unicodedata.itergraphemes (or str.itergraphemes / str.graphemes)

Mon Jul 8 22:02:37 CEST 2013

I think the API Bruce suggests, along with its module location in
'unicodedata' makes more sense than the iterator only.

But it seems to me that it would still be useful to explicitly break a
string into its component clusters with a similar function.  E.g.:

  graphemes = unicodedata.grapheme_clusters(str)  # Returns an iterator of
strings, often single characters
  for g in graphemes: ...

It wouldn't be very hard to implement 'grapheme_clusters' in terms of the
API Bruce suggests, but I feel like it should have a standard name and API
along with those others.  Actually, I guess the implementation is just:

  def grapheme_clusters(s):
      for i in range(len(str)):
          if i == unicodedata.grapheme_start(s, i):
              yield unicodedata.grapheme_cluster(s, i)

On Mon, Jul 8, 2013 at 11:52 AM, Bruce Leban <bruce at leapyear.org> wrote:

>
> On Sun, Jul 7, 2013 at 3:29 AM, David Kendal <me at dpk.io> wrote:
>
> Python provides a way to iterate characters of a string by using the
>> string as an iterable. But there's no way to iterate over Unicode graphemes
>> (a cluster of characters consisting of a base character plus a number of
>> combining marks and other modifiers -- or what the human eye would consider
>> to be one "character").
>>
>> I think this ought to be provided either in the unicodedata library,
>> (unicodedata.itergraphemes(string)) which exposes the character database
>> information needed to make this work, or as a method on the built-in str
>> type. (str.itergraphemes() or str.graphemes())
>
>
> A common case is wanting to extract the current grapheme or move forward
> or backward one. Please consider these other use cases rather than just
> adding an iterator.
>
>  g = unicodedata.grapheme_cluster(str, i)  # extracts cluster that
> includes index i (i may be in the middle of the cluster)
> i = unicodedata.grapheme_start(str, i)  # if i is the start of the
> cluster, returns i; otherwise backs up to the start of the cluster
> i = unicodedata.previous_cluster(str, i)  # moves i to the first index of
> the previous cluster; returns None if no previous cluster in the string
> i = unicodedata.next_cluster(str, i)  # moves i to the first index of the
> next cluster; returns None if no next cluster in the String
>
>
> I think these belongs in unicodedata, not str.
>
> --- Bruce
> I'm hiring:
> http://www.geekwork.com/opportunity/1225-job-software-developer-cadencemd
> Latest blog post: Alice's Puzzle Page http://www.vroospeak.com
> Learn how hackers think: http://j.mp/gruyere-security
>
>
>
> _______________________________________________
> Python-ideas mailing list
> Python-ideas at python.org
> http://mail.python.org/mailman/listinfo/python-ideas
>
>

-- 
Keeping medicines from the bloodstreams of the sick; food
from the bellies of the hungry; books from the hands of the
uneducated; technology from the underdeveloped; and putting
advocates of freedom in prisons.  Intellectual property is
to the 21st century what the slave trade was to the 16th.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20130708/4358c084/attachment-0001.html>