[Python-Dev] Internal representation of strings and Micropython

Wed Jun 4 22:50:42 CEST 2014

On 6/4/2014 6:14 AM, Steve Dower wrote:
> I'm agree with Daniel. Directly indexing into text suggests an 
> attempted optimization that is likely to be incorrect for a set of 
> strings. Splitting, regex, concatenation and formatting are really the 
> main operations that matter, and MicroPython can optimize their 
> implementation of these easily enough for O(N) indexing.
>
> Cheers,
> Steve
>
> Top-posted from my Windows Phone
> ------------------------------------------------------------------------
> From: Daniel Holth <mailto:dholth at gmail.com>
> Sent: ‎6/‎4/‎2014 5:17
> To: Paul Sokolovsky <mailto:pmiscml at gmail.com>
> Cc: python-dev <mailto:python-dev at python.org>
> Subject: Re: [Python-Dev] Internal representation of strings and 
> Micropython
>
> If we're voting I think representing Unicode internally in micropython
> as utf-8 with O(N) indexing is a great idea, partly because I'm not
> sure indexing into strings is a good idea - lots of Unicode code
> points don't make sense by themselves; see also grapheme clusters. It
> would probably work great.

I think native UTF-8 support is the most promising route for a 
micropython Unicode support.

It would be an interesting proof-of-concept to implement an alternative 
CPython with PEP-393 replaced by UTF-8 internally... doing conversions 
for APIs that require a different encoding, but always maintaining and 
computing with the UTF-8 representation.

1) The first proof-of-concept implementation should implement codepoint 
indexing as a O(N) operation, searching from the beginning of the string 
for the Nth codepoint.

Other Proof-of-concept implementation could implement a codepoint 
boundary cache, there could be a variety of caching algorithms.

2) (Least space efficient) An array that could be indexed by codepoint 
position and result in byte position. (This would use more space than a 
UTF-32 representation!)

3) (Most space efficient) One cached entry, that caches the last 
codepoint/byte position referenced. UTF-8 is able to be traversed in 
either direction, so "next/previous" codepoint access would be 
relatively fast (and such are very common operations, even when indexing 
notation is used: "for ix in range( len( str_x )): func( str_x[ ix ])".)

4) (Fixed size caches)  N entries, one for the last codepoint, and 
others at Codepoint_Length/N intervals.  N could be tunable.

5) (Fixed size caches)  Like 4, plus an extra entry like 3.

6) (Variable size caches)  Like 2, but only indexing every  Nth code 
point.  N could be tunable.

7) (Variable size caches)  Like 6, plus an extra entry like 3.

8) (Content specific variable size caches)  Index each codepoint that is 
a different byte size than the previous codepoint, allowing indexing to 
be used in the intervals. Worst case size is like 2, best case size is a 
single entry for the end, when all code points are represented by the 
same number of bytes.

9) (Content specific variable size caches)  Like 8, only cache entries 
could indicate fixed or variable size characters in the next interval, 
with a scheme like 4 or 6 used to prevent one interval from covering the 
whole string.

Other hybrid schemes may present themselves as useful once experience is 
gained with some of these. It might be surprising how few algorithms 
need more than algorithm 3 to get reasonable performance.

Glenn
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-dev/attachments/20140604/24c3f031/attachment-0001.html>