Glenn Linderman:
That said, regexp, or some sort of cursor on a string, might be a workable solution. Will it have adequate performance? Perhaps, at least for some applications. Will it be as conceptually simple as indexing an array of graphemes? No. Will it ever reach the efficiency of indexing an array of graphemes? No. Does that matter? Depends on the application.
Using an iterator for cluster access is a common technique currently. For example, with the Pango text layout and drawing library, you may create a PangoLayoutIter over a text layout object (which contains a UTF-8 string along with formatting information) and iterate by clusters by calling pango_layout_iter_next_cluster. Direct access to clusters by index is not as useful in this domain as access by pixel positions - for example to examine the portion of a layout visible in a window. http://developer.gnome.org/pango/stable/pango-Layout-Objects.html#pango-layo... In this API, 'index' is used to refer to a byte index into UTF-8, not a character or cluster index. Rather than discuss functionality in the abstract, we need some use cases involving different levels of character and cluster access to see whether providing indexed access is worthwhile. I'll start with an example: some text drawing engines draw decomposed characters ("o" followed by " ̈" -> "ö") differently compared to their composite equivalents ("ö") and this may be perceived as better or worse. I'd like to offer an option to replace some decomposed characters with their composite equivalent before drawing but since other characters may look worse, I don't want to do a full normalization. The API style that appears most useful for this example is an iterator over the input string that yields composed and decomposed character strings (that is, it will yield both "ö" and "ö"), each character string is then converted if in a substitution dictionary and written to an output string. This is similar to an iterator over grapheme clusters although, since it is only aimed at composing sequences, the iterator could be simpler than a full grapheme cluster iterator. One of the benefits of iterator access to text is that many different iterators can be built without burdening the implementation object with extra memory costs as would be likely with techniques that build indexes into the representation. Neil