<html>
<head>
<meta content="text/html; charset=UTF-8" http-equiv="Content-Type">
</head>
<body bgcolor="#FFFFFF" text="#330033">
On 8/31/2011 5:58 PM, Neil Hodgson wrote:
<blockquote
cite="mid:CAMLCkUeqSVt7LirPEvj_=nZp7nwb9uS8z4ba7LK2dFHdmXrQhw@mail.gmail.com"
type="cite">
<pre wrap="">Glenn Linderman:
</pre>
<blockquote type="cite">
<pre wrap="">That said, regexp, or some sort of cursor on a string, might be a workable
solution. Will it have adequate performance? Perhaps, at least for some
applications. Will it be as conceptually simple as indexing an array of
graphemes? No. Will it ever reach the efficiency of indexing an array of
graphemes? No. Does that matter? Depends on the application.
</pre>
</blockquote>
<pre wrap="">
Using an iterator for cluster access is a common technique
currently. For example, with the Pango text layout and drawing
library, you may create a PangoLayoutIter over a text layout object
(which contains a UTF-8 string along with formatting information) and
iterate by clusters by calling pango_layout_iter_next_cluster. Direct
access to clusters by index is not as useful in this domain as access
by pixel positions - for example to examine the portion of a layout
visible in a window.
<a class="moz-txt-link-freetext" href="http://developer.gnome.org/pango/stable/pango-Layout-Objects.html#pango-layout-get-iter">http://developer.gnome.org/pango/stable/pango-Layout-Objects.html#pango-layout-get-iter</a>
In this API, 'index' is used to refer to a byte index into UTF-8,
not a character or cluster index.</pre>
</blockquote>
<br>
I agree that different applications may have different needs for
different types of indexes to various starting points in a large
string. Where a custom index is required, a standard index may not
be needed.<br>
<br>
<blockquote
cite="mid:CAMLCkUeqSVt7LirPEvj_=nZp7nwb9uS8z4ba7LK2dFHdmXrQhw@mail.gmail.com"
type="cite">
<pre wrap=""> One of the benefits of iterator access to text is that many
different iterators can be built without burdening the implementation
object with extra memory costs as would be likely with techniques that
build indexes into the representation.
</pre>
</blockquote>
<br>
How many different iterators into the same text would be
concurrently needed by an application? And why? Seems like if it
is dealing with text at the level of grapheme clusters, it needs
that type of iterator. Of course, if it does I/O it needs codec
access, but that is by nature sequential from the starting point to
the end point.<br>
</body>
</html>