unicodedata.itergraphemes (or str.itergraphemes / str.graphemes)
Hi, Python provides a way to iterate characters of a string by using the string as an iterable. But there's no way to iterate over Unicode graphemes (a cluster of characters consisting of a base character plus a number of combining marks and other modifiers -- or what the human eye would consider to be one "character"). I think this ought to be provided either in the unicodedata library, (unicodedata.itergraphemes(string)) which exposes the character database information needed to make this work, or as a method on the built-in str type. (str.itergraphemes() or str.graphemes()) Below is my own implementation of this as a generator, as an example and for reference. --- import unicodedata def itergraphemes(string): def ismodifier(char): return unicodedata.category(char)[0] == 'M' start = 0 for end, char in enumerate(string): if not ismodifier(char) and not start == end: yield string[start:end] start = end yield string[start:] --- Thanks, dpk
On 07/07/2013 11:29, David Kendal wrote:
Hi,
Python provides a way to iterate characters of a string by using the string as an iterable. But there's no way to iterate over Unicode graphemes (a cluster of characters consisting of a base character plus a number of combining marks and other modifiers -- or what the human eye would consider to be one "character").
I think this ought to be provided either in the unicodedata library, (unicodedata.itergraphemes(string)) which exposes the character database information needed to make this work, or as a method on the built-in str type. (str.itergraphemes() or str.graphemes())
Below is my own implementation of this as a generator, as an example and for reference.
--- import unicodedata
def itergraphemes(string): def ismodifier(char): return unicodedata.category(char)[0] == 'M' start = 0 for end, char in enumerate(string): if not ismodifier(char) and not start == end: yield string[start:end] start = end yield string[start:] ---
The definition of a grapheme cluster is actually a little more complicated than that. See here: http://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries
On 07.07.2013 12:29, David Kendal wrote:
Hi,
Python provides a way to iterate characters of a string by using the string as an iterable. But there's no way to iterate over Unicode graphemes (a cluster of characters consisting of a base character plus a number of combining marks and other modifiers -- or what the human eye would consider to be one "character").
I think this ought to be provided either in the unicodedata library, (unicodedata.itergraphemes(string)) which exposes the character database information needed to make this work, or as a method on the built-in str type. (str.itergraphemes() or str.graphemes())
Below is my own implementation of this as a generator, as an example and for reference.
--- import unicodedata
def itergraphemes(string): def ismodifier(char): return unicodedata.category(char)[0] == 'M' start = 0 for end, char in enumerate(string): if not ismodifier(char) and not start == end: yield string[start:end] start = end yield string[start:] ---
Sounds like a good idea. Could you open a ticket for this to hash out the details ? Thanks, -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Jul 08 2013)
Python Projects, Consulting and Support ... http://www.egenix.com/ mxODBC.Zope/Plone.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
2013-07-16: Python Meeting Duesseldorf ... 8 days to go ::::: Try our mxODBC.Connect Python Database Interface for free ! :::::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/
On 8 Jul 2013, at 19:26, David Kendal <me@dpk.io> wrote:
Could you open a ticket for this to hash out the details ?
Done!
Ooops. Should have included a link, sorry. <http://bugs.python.org/issue18406> dpk
On 08.07.2013 20:27, David Kendal wrote:
On 8 Jul 2013, at 19:26, David Kendal <me@dpk.io> wrote:
Could you open a ticket for this to hash out the details ?
Done!
Ooops. Should have included a link, sorry. <http://bugs.python.org/issue18406>
Thanks. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Jul 09 2013)
Python Projects, Consulting and Support ... http://www.egenix.com/ mxODBC.Zope/Plone.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
2013-07-16: Python Meeting Duesseldorf ... 7 days to go ::::: Try our mxODBC.Connect Python Database Interface for free ! :::::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/
On Sun, Jul 7, 2013 at 3:29 AM, David Kendal <me@dpk.io> wrote:
Python provides a way to iterate characters of a string by using the string as an iterable. But there's no way to iterate over Unicode graphemes (a cluster of characters consisting of a base character plus a number of combining marks and other modifiers -- or what the human eye would consider to be one "character").
I think this ought to be provided either in the unicodedata library, (unicodedata.itergraphemes(string)) which exposes the character database information needed to make this work, or as a method on the built-in str type. (str.itergraphemes() or str.graphemes())
A common case is wanting to extract the current grapheme or move forward or backward one. Please consider these other use cases rather than just adding an iterator. g = unicodedata.grapheme_cluster(str, i) # extracts cluster that includes index i (i may be in the middle of the cluster) i = unicodedata.grapheme_start(str, i) # if i is the start of the cluster, returns i; otherwise backs up to the start of the cluster i = unicodedata.previous_cluster(str, i) # moves i to the first index of the previous cluster; returns None if no previous cluster in the string i = unicodedata.next_cluster(str, i) # moves i to the first index of the next cluster; returns None if no next cluster in the String I think these belongs in unicodedata, not str. --- Bruce I'm hiring: http://www.geekwork.com/opportunity/1225-job-software-developer-cadencemd Latest blog post: Alice's Puzzle Page http://www.vroospeak.com Learn how hackers think: http://j.mp/gruyere-security
I think the API Bruce suggests, along with its module location in 'unicodedata' makes more sense than the iterator only. But it seems to me that it would still be useful to explicitly break a string into its component clusters with a similar function. E.g.: graphemes = unicodedata.grapheme_clusters(str) # Returns an iterator of strings, often single characters for g in graphemes: ... It wouldn't be very hard to implement 'grapheme_clusters' in terms of the API Bruce suggests, but I feel like it should have a standard name and API along with those others. Actually, I guess the implementation is just: def grapheme_clusters(s): for i in range(len(str)): if i == unicodedata.grapheme_start(s, i): yield unicodedata.grapheme_cluster(s, i) On Mon, Jul 8, 2013 at 11:52 AM, Bruce Leban <bruce@leapyear.org> wrote:
On Sun, Jul 7, 2013 at 3:29 AM, David Kendal <me@dpk.io> wrote:
Python provides a way to iterate characters of a string by using the
string as an iterable. But there's no way to iterate over Unicode graphemes (a cluster of characters consisting of a base character plus a number of combining marks and other modifiers -- or what the human eye would consider to be one "character").
I think this ought to be provided either in the unicodedata library, (unicodedata.itergraphemes(string)) which exposes the character database information needed to make this work, or as a method on the built-in str type. (str.itergraphemes() or str.graphemes())
A common case is wanting to extract the current grapheme or move forward or backward one. Please consider these other use cases rather than just adding an iterator.
g = unicodedata.grapheme_cluster(str, i) # extracts cluster that includes index i (i may be in the middle of the cluster) i = unicodedata.grapheme_start(str, i) # if i is the start of the cluster, returns i; otherwise backs up to the start of the cluster i = unicodedata.previous_cluster(str, i) # moves i to the first index of the previous cluster; returns None if no previous cluster in the string i = unicodedata.next_cluster(str, i) # moves i to the first index of the next cluster; returns None if no next cluster in the String
I think these belongs in unicodedata, not str.
--- Bruce I'm hiring: http://www.geekwork.com/opportunity/1225-job-software-developer-cadencemd Latest blog post: Alice's Puzzle Page http://www.vroospeak.com Learn how hackers think: http://j.mp/gruyere-security
_______________________________________________ Python-ideas mailing list Python-ideas@python.org http://mail.python.org/mailman/listinfo/python-ideas
-- Keeping medicines from the bloodstreams of the sick; food from the bellies of the hungry; books from the hands of the uneducated; technology from the underdeveloped; and putting advocates of freedom in prisons. Intellectual property is to the 21st century what the slave trade was to the 16th.
On Mon, Jul 8, 2013 at 1:02 PM, David Mertz <mertz@gnosis.cx> wrote:
I think the API Bruce suggests, along with its module location in 'unicodedata' makes more sense than the iterator only.
But it seems to me that it would still be useful to explicitly break a string into its component clusters with a similar function. E.g.:
graphemes = unicodedata.grapheme_clusters(str) # Returns an iterator of strings, often single characters for g in graphemes: ...
It wouldn't be very hard to implement 'grapheme_clusters' in terms of the API Bruce suggests, but I feel like it should have a standard name and API along with those others. Actually, I guess the implementation is just:
def grapheme_clusters(s): for i in range(len(str)): if i == unicodedata.grapheme_start(s, i): yield unicodedata.grapheme_cluster(s, i)
Yes, I still think the iterator is useful. I'd use the following implementation instead as the above is going to find the start of each multi-char grapheme multiple times. def grapheme_clusters(s): if len(str): i = 0 while i is not None: yield unicodedata.grapheme_cluster(s, i) i = unicodedata.grapheme_next(str, i) This does "if len(str)" at the top rather than just "if str" so it raises if passed a non-iterable like None rather than silently accepting it. --- Bruce
On 08/07/2013 21:26, Bruce Leban wrote:
On Mon, Jul 8, 2013 at 1:02 PM, David Mertz <mertz@gnosis.cx <mailto:mertz@gnosis.cx>> wrote:
I think the API Bruce suggests, along with its module location in 'unicodedata' makes more sense than the iterator only.
But it seems to me that it would still be useful to explicitly break a string into its component clusters with a similar function. E.g.:
graphemes = unicodedata.grapheme_clusters(str) # Returns an iterator of strings, often single characters for g in graphemes: ...
It wouldn't be very hard to implement 'grapheme_clusters' in terms of the API Bruce suggests, but I feel like it should have a standard name and API along with those others. Actually, I guess the implementation is just:
def grapheme_clusters(s): for i in range(len(str)): if i == unicodedata.grapheme_start(s, i): yield unicodedata.grapheme_cluster(s, i)
Yes, I still think the iterator is useful. I'd use the following implementation instead as the above is going to find the start of each multi-char grapheme multiple times.
def grapheme_clusters(s): if len(str): i = 0 while i is not None: yield unicodedata.grapheme_cluster(s, i) i = unicodedata.grapheme_next(str, i)
This does "if len(str)" at the top rather than just "if str" so it raises if passed a non-iterable like None rather than silently accepting it.
If it's any help, the alternative regex implementation at: http://pypi.python.org/pypi/regex supports matching graphemes, although that bit is written in C.
Bruce Leban writes:
On Sun, Jul 7, 2013 at 3:29 AM, David Kendal <me@dpk.io> wrote:
But there's no way to iterate over Unicode graphemes
A common case is wanting to extract the current grapheme or move forward or backward one. Please consider these other use cases rather than just adding an iterator.
g = unicodedata.grapheme_cluster(str, i) # extracts cluster that includes index i (i may be in the middle # of the cluster)
Why is indexing a string and returning a grapheme a common case? I would think the common case would be indexing or iterating over a grapheme sequence. At least, if we provided such a type, it would be.[1] Also, for 20 years I've worked with Emacs/Mule which has a multibyte internal representation of characters, and so does a lot of byte index <-> character index conversion in the internals. I would like to avoid imposing that confusion on application programmers, unless they really need it for some reason. Footnotes: [1] Well, of course a lot of applications would continue to work with strs, just as today some applications work directly with bytes even though the content is readable text that could sensibly be translated to str. What I mean is that I expect that indexing str to get grapheme would be rare in applications if grapheme iterators and arrays were available.
On 2013-07-09, at 07:30 , Stephen J. Turnbull wrote:
Bruce Leban writes:
On Sun, Jul 7, 2013 at 3:29 AM, David Kendal <me@dpk.io> wrote:
But there's no way to iterate over Unicode graphemes
A common case is wanting to extract the current grapheme or move forward or backward one. Please consider these other use cases rather than just adding an iterator.
g = unicodedata.grapheme_cluster(str, i) # extracts cluster that includes index i (i may be in the middle # of the cluster)
Why is indexing a string and returning a grapheme a common case?
I don't know about that but I do know NSString provides two messages for that (one takes an index in a string and returns the corresponding grapheme boundaries — rangeOfComposedCharacterSequenceAtIndex:; and the other takes a range and returns the range of all composing graphemes — rangeOfComposedCharacterSequencesForRange:). Of course that might just be because it does not provide a higher-level iterator on graphemes.
On Mon, Jul 8, 2013 at 10:30 PM, Stephen J. Turnbull <stephen@xemacs.org>wrote:
Why is indexing a string and returning a grapheme a common case? I would think the common case would be indexing or iterating over a grapheme sequence. At least, if we provided such a type, it would be.[1]
If you want to do any operation on the clusters other than in iteration order, without indexed access you're going to end up doing list(grapheme_clusters(...)) first to give you indexed access. Maybe that's the right thing to do sometimes but I wouldn't force it on people. The string already provides indexed access but I need to know cluster boundaries. Note that str.find returns an int, not the found string. What do I do with that index if I can't extract clusters in the middle? Imagine you're writing code that works on English words. Would the only api you provide be one that iterates over the words? How would you write the function that finds the word after 'the' in a string? --- Bruce
On 7/9/2013 12:51 PM, Bruce Leban wrote:
If you want to do any operation on the clusters other than in iteration order, without indexed access you're going to end up doing list(grapheme_clusters(...)) first to give you indexed access. Maybe that's the right thing to do sometimes but I wouldn't force it on people. The string already provides indexed access but I need to know cluster boundaries.
I think the best alternative to a list subclass of grapheme substrings (a subclass so can add methods), might be a GraphemeSeq wrapper class that contains a string (perhaps in a known normal form) and a list of indexes to grapheme start positions. That would also allow grapheme-oriented methods. If not already done, either or both of these would be good pypi modules. -- Terry Jan Reedy
On 9 Jul 2013, at 17:51, Bruce Leban <bruce@leapyear.org> wrote:
If you want to do any operation on the clusters other than in iteration order, without indexed access you're going to end up doing list(grapheme_clusters(...)) first to give you indexed access. Maybe that's the right thing to do sometimes but I wouldn't force it on people. The string already provides indexed access but I need to know cluster boundaries.
There's no reason the iterator returned can't be of a new type that allows indexing with the subscript operator.
--- Bruce
dpk
On 10 July 2013 00:15, David Kendal <me@dpk.io> wrote:
On 9 Jul 2013, at 17:51, Bruce Leban <bruce@leapyear.org> wrote:
If you want to do any operation on the clusters other than in iteration order, without indexed access you're going to end up doing list(grapheme_clusters(...)) first to give you indexed access. Maybe that's the right thing to do sometimes but I wouldn't force it on people. The string already provides indexed access but I need to know cluster boundaries.
There's no reason the iterator returned can't be of a new type that allows indexing with the subscript operator.
I've only loosely followed this thread but that sounds like a really weird idea to me. The standard is to have an object with the properties you want that can be coerced to an iterator through its __iter__ method. Maybe that's what you meant, though.
range(133)[32] 32 iter(range(133))[32] Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: 'range_iterator' object is not subscriptable
On 10 Jul 2013, at 00:40, Joshua Landau <joshua@landau.ws> wrote:
I've only loosely followed this thread but that sounds like a really weird idea to me. The standard is to have an object with the properties you want that can be coerced to an iterator through its __iter__ method. Maybe that's what you meant, though.
Well, right. I meant "a new type" like dict.keys() and dict.values() are "view types" on a dictionary that provide iterator interfaces. This would just be a "grapheme view" on a string. dpk
2013/7/10 David Kendal <me@dpk.io> Well, right. I meant “a new type” like dict.keys() and dict.values() are “view types” on a dictionary that provide iterator interfaces. This would just be a “grapheme view” on a string. i think that’s the way to go. who would want dozens of new functions in unicodedata? how about something like the following? it can easily be extended to get a reverse iterator. setting its pos and calling find_grapheme or __next__ or previous allows for bruce’s usecases. class GraphemeIterator: def __init__(self, string, start=0): self.string = string self.pos = start def __iter__(self): return self def __next__(self): _, next_pos, grapheme = self.find_grapheme() self.pos = next_pos return grapheme def previous(self): prev_pos, _, grapheme = self.find_grapheme(backwards=True) self.pos = prev_pos return grapheme def find_grapheme(self, i=None, *, backwards=False): """finds next complete grapheme in string, starting at position i if backwards is not set, finds grapheme starting at i, or the next one if i is in the middle of one if it is set, it finds the grapheme which i points to, even if that’s the middle. if str[i] is the beginning of a grapheme, backwards finds the one before it. """ if i is None: i = self.pos ... return (start, end, grapheme) def find_grapheme(string, i, backwards=False): """ convenience function for oneshotting it """ return GraphemeIterator(string, i).find_grapheme(backwards=backwards)
On 10 July 2013 13:04, Philipp A. <flying-sheep@web.de> wrote:
2013/7/10 David Kendal <me@dpk.io>
Well, right. I meant “a new type” like dict.keys() and dict.values() are “view types” on a dictionary that provide iterator interfaces. This would just be a “grapheme view” on a string.
i think that’s the way to go. who would want dozens of new functions in unicodedata?
You've missed both of our points. Consider:
{}.keys() dict_keys([]) iter({}.keys()) <dict_keyiterator object at 0x7fe3d633a890>
There are good reasons why a "view" should not be its iterator.
how about something like the following? it can easily be extended to get a reverse iterator.
setting its pos and calling find_grapheme or __next__ or previous allows for bruce’s usecases.
class GraphemeIterator: def __init__(self, string, start=0): self.string = string self.pos = start
def __iter__(self): return self
def __next__(self): _, next_pos, grapheme = self.find_grapheme() self.pos = next_pos return grapheme
def previous(self): prev_pos, _, grapheme = self.find_grapheme(backwards=True) self.pos = prev_pos return grapheme
def find_grapheme(self, i=None, *, backwards=False): """finds next complete grapheme in string, starting at position i if backwards is not set, finds grapheme starting at i, or the next one if i is in the middle of one if it is set, it finds the grapheme which i points to, even if that’s the middle. if str[i] is the beginning of a grapheme, backwards finds the one before it. """ if i is None: i = self.pos ... return (start, end, grapheme)
def find_grapheme(string, i, backwards=False): """ convenience function for oneshotting it """ return GraphemeIterator(string, i).find_grapheme(backwards=backwards)
2013/7/10 Joshua Landau joshua@landau.ws
{}.keys() dict_keys([]) iter({}.keys()) <dict_keyiterator object at 0x7fe3d633a890>
There are good reasons why a “view” should not be its iterator. you’re right, but one would expect the view’s __getitem__(i) method to return the ith grapheme, which implies constant-time access. and we can only support linear-time access to that (i.e. by iterating stuff) if we don’t want to build a complex index. so should we do a view object that only allows something like my find_grapheme and iteration?
On 10 July 2013 18:39, Philipp A. <flying-sheep@web.de> wrote:
2013/7/10 Joshua Landau joshua@landau.ws
{}.keys() dict_keys([]) iter({}.keys()) <dict_keyiterator object at 0x7fe3d633a890>
There are good reasons why a “view” should not be its iterator.
you’re right, but one would expect the view’s __getitem__(i) method to return the ith grapheme, which implies constant-time access. and we can only support linear-time access to that (i.e. by iterating stuff) if we don’t want to build a complex index.
so should we do a view object that only allows something like my find_grapheme and iteration?
I haven't followed much of this because it's not very relevant to me now. I just thought it extremely odd to have an interface inconsistent with Python's standard. However, if what you want is something that works akin to a IOWrapper, then I'm wrong and an iterator that has lots of methods is actually already standard. Hence I've changed my mind. That may not have been what you expected me to say.
On 08.07.2013 20:52, Bruce Leban wrote:
On Sun, Jul 7, 2013 at 3:29 AM, David Kendal <me@dpk.io> wrote:
Python provides a way to iterate characters of a string by using the string as an iterable. But there's no way to iterate over Unicode graphemes (a cluster of characters consisting of a base character plus a number of combining marks and other modifiers -- or what the human eye would consider to be one "character").
I think this ought to be provided either in the unicodedata library, (unicodedata.itergraphemes(string)) which exposes the character database information needed to make this work, or as a method on the built-in str type. (str.itergraphemes() or str.graphemes())
A common case is wanting to extract the current grapheme or move forward or backward one. Please consider these other use cases rather than just adding an iterator.
g = unicodedata.grapheme_cluster(str, i) # extracts cluster that includes index i (i may be in the middle of the cluster) i = unicodedata.grapheme_start(str, i) # if i is the start of the cluster, returns i; otherwise backs up to the start of the cluster i = unicodedata.previous_cluster(str, i) # moves i to the first index of the previous cluster; returns None if no previous cluster in the string i = unicodedata.next_cluster(str, i) # moves i to the first index of the next cluster; returns None if no next cluster in the String
I think these belongs in unicodedata, not str.
FWIW: Here's a pre-PEP I once wrote for these things: http://mail.python.org/pipermail/python-dev/2001-July/015938.html At the time there was little interest, so I dropped the idea. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Jul 09 2013)
Python Projects, Consulting and Support ... http://www.egenix.com/ mxODBC.Zope/Plone.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
2013-07-16: Python Meeting Duesseldorf ... 7 days to go ::::: Try our mxODBC.Connect Python Database Interface for free ! :::::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/
participants (10)
-
Bruce Leban -
David Kendal -
David Mertz -
Joshua Landau -
M.-A. Lemburg -
Masklinn -
MRAB -
Philipp A. -
Stephen J. Turnbull -
Terry Reedy