[Python-ideas] unicodedata.itergraphemes (or str.itergraphemes / str.graphemes)
MRAB
python at mrabarnett.plus.com
Mon Jul 8 22:41:20 CEST 2013
On 08/07/2013 21:26, Bruce Leban wrote:
>
> On Mon, Jul 8, 2013 at 1:02 PM, David Mertz <mertz at gnosis.cx
> <mailto:mertz at gnosis.cx>> wrote:
>
> I think the API Bruce suggests, along with its module location in
> 'unicodedata' makes more sense than the iterator only.
>
> But it seems to me that it would still be useful to explicitly break
> a string into its component clusters with a similar function. E.g.:
>
> graphemes = unicodedata.grapheme_clusters(str) # Returns an
> iterator of strings, often single characters
> for g in graphemes: ...
>
> It wouldn't be very hard to implement 'grapheme_clusters' in terms
> of the API Bruce suggests, but I feel like it should have a standard
> name and API along with those others. Actually, I guess the
> implementation is just:
>
> def grapheme_clusters(s):
> for i in range(len(str)):
> if i == unicodedata.grapheme_start(s, i):
> yield unicodedata.grapheme_cluster(s, i)
>
>
> Yes, I still think the iterator is useful. I'd use the following
> implementation instead as the above is going to find the start of each
> multi-char grapheme multiple times.
>
> def grapheme_clusters(s):
> if len(str):
> i = 0
> while i is not None:
> yield unicodedata.grapheme_cluster(s, i)
> i = unicodedata.grapheme_next(str, i)
>
>
> This does "if len(str)" at the top rather than just "if str" so it
> raises if passed a non-iterable like None rather than silently accepting it.
>
If it's any help, the alternative regex implementation at:
http://pypi.python.org/pypi/regex
supports matching graphemes, although that bit is written in C.
More information about the Python-ideas
mailing list