[Python-Dev] PEP 393 Summer of Code Project

Neil Hodgson nyamatongwe at gmail.com
Thu Sep 1 02:58:57 CEST 2011


Glenn Linderman:

> That said, regexp, or some sort of cursor on a string, might be a workable
> solution.  Will it have adequate performance?  Perhaps, at least for some
> applications.  Will it be as conceptually simple as indexing an array of
> graphemes?  No.  Will it ever reach the efficiency of indexing an array of
> graphemes? No.  Does that matter? Depends on the application.

   Using an iterator for cluster access is a common technique
currently. For example, with the Pango text layout and drawing
library, you may create a PangoLayoutIter over a text layout object
(which contains a UTF-8 string along with formatting information) and
iterate by clusters by calling pango_layout_iter_next_cluster. Direct
access to clusters by index is not as useful in this domain as access
by pixel positions - for example to examine the portion of a layout
visible in a window.

   http://developer.gnome.org/pango/stable/pango-Layout-Objects.html#pango-layout-get-iter
   In this API, 'index' is used to refer to a byte index into UTF-8,
not a character or cluster index.

   Rather than discuss functionality in the abstract, we need some use
cases involving different levels of character and cluster access to
see whether providing indexed access is worthwhile. I'll start with an
example: some text drawing engines draw decomposed characters ("o"
followed by " ̈" -> "ö") differently compared to their composite
equivalents ("ö") and this may be perceived as better or worse. I'd
like to offer an option to replace some decomposed characters with
their composite equivalent before drawing but since other characters
may look worse, I don't want to do a full normalization. The API style
that appears most useful for this example is an iterator over the
input string that yields composed and decomposed character strings
(that is, it will yield both "ö" and "ö"), each character string is
then converted if in a substitution dictionary and written to an
output string. This is similar to an iterator over grapheme clusters
although, since it is only aimed at composing sequences, the iterator
could be simpler than a full grapheme cluster iterator.

   One of the benefits of iterator access to text is that many
different iterators can be built without burdening the implementation
object with extra memory costs as would be likely with techniques that
build indexes into the representation.

   Neil


More information about the Python-Dev mailing list