[Python-Dev] Internal representation of strings and Micropython
Glenn Linderman
v+python at g.nevcal.com
Wed Jun 4 22:50:42 CEST 2014
On 6/4/2014 6:14 AM, Steve Dower wrote:
> I'm agree with Daniel. Directly indexing into text suggests an
> attempted optimization that is likely to be incorrect for a set of
> strings. Splitting, regex, concatenation and formatting are really the
> main operations that matter, and MicroPython can optimize their
> implementation of these easily enough for O(N) indexing.
>
> Cheers,
> Steve
>
> Top-posted from my Windows Phone
> ------------------------------------------------------------------------
> From: Daniel Holth <mailto:dholth at gmail.com>
> Sent: 6/4/2014 5:17
> To: Paul Sokolovsky <mailto:pmiscml at gmail.com>
> Cc: python-dev <mailto:python-dev at python.org>
> Subject: Re: [Python-Dev] Internal representation of strings and
> Micropython
>
> If we're voting I think representing Unicode internally in micropython
> as utf-8 with O(N) indexing is a great idea, partly because I'm not
> sure indexing into strings is a good idea - lots of Unicode code
> points don't make sense by themselves; see also grapheme clusters. It
> would probably work great.
I think native UTF-8 support is the most promising route for a
micropython Unicode support.
It would be an interesting proof-of-concept to implement an alternative
CPython with PEP-393 replaced by UTF-8 internally... doing conversions
for APIs that require a different encoding, but always maintaining and
computing with the UTF-8 representation.
1) The first proof-of-concept implementation should implement codepoint
indexing as a O(N) operation, searching from the beginning of the string
for the Nth codepoint.
Other Proof-of-concept implementation could implement a codepoint
boundary cache, there could be a variety of caching algorithms.
2) (Least space efficient) An array that could be indexed by codepoint
position and result in byte position. (This would use more space than a
UTF-32 representation!)
3) (Most space efficient) One cached entry, that caches the last
codepoint/byte position referenced. UTF-8 is able to be traversed in
either direction, so "next/previous" codepoint access would be
relatively fast (and such are very common operations, even when indexing
notation is used: "for ix in range( len( str_x )): func( str_x[ ix ])".)
4) (Fixed size caches) N entries, one for the last codepoint, and
others at Codepoint_Length/N intervals. N could be tunable.
5) (Fixed size caches) Like 4, plus an extra entry like 3.
6) (Variable size caches) Like 2, but only indexing every Nth code
point. N could be tunable.
7) (Variable size caches) Like 6, plus an extra entry like 3.
8) (Content specific variable size caches) Index each codepoint that is
a different byte size than the previous codepoint, allowing indexing to
be used in the intervals. Worst case size is like 2, best case size is a
single entry for the end, when all code points are represented by the
same number of bytes.
9) (Content specific variable size caches) Like 8, only cache entries
could indicate fixed or variable size characters in the next interval,
with a scheme like 4 or 6 used to prevent one interval from covering the
whole string.
Other hybrid schemes may present themselves as useful once experience is
gained with some of these. It might be surprising how few algorithms
need more than algorithm 3 to get reasonable performance.
Glenn
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-dev/attachments/20140604/24c3f031/attachment-0001.html>
More information about the Python-Dev
mailing list