On 6/4/2014 6:14 AM, Steve Dower wrote:
I'm agree with Daniel. Directly indexing into text suggests an attempted optimization that is likely to be incorrect for a set of strings. Splitting, regex, concatenation and formatting are really the main operations that matter, and MicroPython can optimize their implementation of these easily enough for O(N) indexing.

Cheers,
Steve

Top-posted from my Windows Phone

From: Daniel Holth
Sent: ‎6/‎4/‎2014 5:17
To: Paul Sokolovsky
Cc: python-dev
Subject: Re: [Python-Dev] Internal representation of strings and Micropython

If we're voting I think representing Unicode internally in micropython
as utf-8 with O(N) indexing is a great idea, partly because I'm not
sure indexing into strings is a good idea - lots of Unicode code
points don't make sense by themselves; see also grapheme clusters. It
would probably work great.

I think native UTF-8 support is the most promising route for a micropython Unicode support.

It would be an interesting proof-of-concept to implement an alternative CPython with PEP-393 replaced by UTF-8 internally... doing conversions for APIs that require a different encoding, but always maintaining and computing with the UTF-8 representation.

1) The first proof-of-concept implementation should implement codepoint indexing as a O(N) operation, searching from the beginning of the string for the Nth codepoint.

Other Proof-of-concept implementation could implement a codepoint boundary cache, there could be a variety of caching algorithms.

2) (Least space efficient) An array that could be indexed by codepoint position and result in byte position. (This would use more space than a UTF-32 representation!)

3) (Most space efficient) One cached entry, that caches the last codepoint/byte position referenced. UTF-8 is able to be traversed in either direction, so "next/previous" codepoint access would be relatively fast (and such are very common operations, even when indexing notation is used: "for ix in range( len( str_x )): func( str_x[ ix ])".)

4) (Fixed size caches)  N entries, one for the last codepoint, and others at Codepoint_Length/N intervals.  N could be tunable.

5) (Fixed size caches)  Like 4, plus an extra entry like 3.

6) (Variable size caches)  Like 2, but only indexing every  Nth code point.  N could be tunable.

7) (Variable size caches)  Like 6, plus an extra entry like 3.

8) (Content specific variable size caches)  Index each codepoint that is a different byte size than the previous codepoint, allowing indexing to be used in the intervals. Worst case size is like 2, best case size is a single entry for the end, when all code points are represented by the same number of bytes.

9) (Content specific variable size caches)  Like 8, only cache entries could indicate fixed or variable size characters in the next interval, with a scheme like 4 or 6 used to prevent one interval from covering the whole string.

Other hybrid schemes may present themselves as useful once experience is gained with some of these. It might be surprising how few algorithms need more than algorithm 3 to get reasonable performance.

Glenn