Glenn Linderman writes:
- (Most space efficient) One cached entry, that caches the last
codepoint/byte position referenced. UTF-8 is able to be traversed in either direction, so "next/previous" codepoint access would be relatively fast (and such are very common operations, even when indexing notation is used: "for ix in range( len( str_x )): func( str_x[ ix ])".)
Been there, tried that (Emacsen). Either it's a YAGNI (moving forward or backward over UTF-8 by characters short distances is plenty fast, especially if you've got a lot of ASCII you can move by words for somewhat longer distances), or it's not good enough. There *may* be a sweet spot, but it's definitely smaller than the one on Sharapova's racket.
- (Fixed size caches) N entries, one for the last codepoint, and
others at Codepoint_Length/N intervals. N could be tunable.
To achieve space saving, cache has to be quite small, and the bigger your integers, the smaller it gets. A naive implementation on 64-bit machine would give you 16 bytes/cache entry. Using a non-native size will be a space win, but needs care in implementation. Initializing the cache is very expensive for small strings, so you need conditional and maybe lazy initialization (for large strings).
By the way, there's also
10) Keep counts of the leading and trailing number of ASCII (one-octet) characters. This is often a *huge* win; it's quite common to encounter documents where size - lc - tc = 2 (ie, there's only one two-octet character in the document).
11) Keep a list (or tree) of most-recently-accessed positions.
Despite my negative experience with multibyte encodings in Emacsen, I'm persuaded by the arguments that there probably aren't all that many places in core Python where indexing is used in an essential way, so MicroPython itself can probably optimize those "behind the scenes". Application programmers in the embedded context may be expected to be deal with the need to avoid random access algorithms and use iterators and generators to accomplish most tasks.