On 6/4/2014 6:14 AM, Steve Dower wrote:
agree with Daniel. Directly indexing into text suggests an
attempted optimization that is likely to be incorrect for a
set of strings. Splitting, regex, concatenation and
formatting are really the main operations that matter, and
MicroPython can optimize their implementation of these
easily enough for O(N) indexing.
Top-posted from my Windows Phone
If we're voting I think representing
Unicode internally in micropython
as utf-8 with O(N) indexing is a great idea, partly because
sure indexing into strings is a good idea - lots of Unicode
points don't make sense by themselves; see also grapheme
would probably work great.
I think native UTF-8 support is the most promising route for a
micropython Unicode support.
It would be an interesting proof-of-concept to implement an
alternative CPython with PEP-393 replaced by UTF-8 internally...
doing conversions for APIs that require a different encoding, but
always maintaining and computing with the UTF-8 representation.
1) The first proof-of-concept implementation should implement
codepoint indexing as a O(N) operation, searching from the beginning
of the string for the Nth codepoint.
Other Proof-of-concept implementation could implement a codepoint
boundary cache, there could be a variety of caching algorithms.
2) (Least space efficient) An array that could be indexed by
codepoint position and result in byte position. (This would use more
space than a UTF-32 representation!)
3) (Most space efficient) One cached entry, that caches the last
codepoint/byte position referenced. UTF-8 is able to be traversed in
either direction, so "next/previous" codepoint access would be
relatively fast (and such are very common operations, even when
indexing notation is used: "for ix in range( len( str_x )): func(
str_x[ ix ])".)
4) (Fixed size caches) N entries, one for the last codepoint, and
others at Codepoint_Length/N intervals. N could be tunable.
5) (Fixed size caches) Like 4, plus an extra entry like 3.
6) (Variable size caches) Like 2, but only indexing every Nth code
point. N could be tunable.
7) (Variable size caches) Like 6, plus an extra entry like 3.
8) (Content specific variable size caches) Index each codepoint
that is a different byte size than the previous codepoint, allowing
indexing to be used in the intervals. Worst case size is like 2,
best case size is a single entry for the end, when all code points
are represented by the same number of bytes.
9) (Content specific variable size caches) Like 8, only cache
entries could indicate fixed or variable size characters in the next
interval, with a scheme like 4 or 6 used to prevent one interval
from covering the whole string.
Other hybrid schemes may present themselves as useful once
experience is gained with some of these. It might be surprising how
few algorithms need more than algorithm 3 to get reasonable