[Python-Dev] Internal representation of strings and Micropython

Wed Jun 4 09:02:13 CEST 2014

Zitat von Steven D'Aprano <steve at pearwood.info>:

> * Having a build-time option to restrict all strings to ASCII-only.
>
> (I think what they mean by that is that strings will be like Python 2
> strings, ASCII-plus-arbitrary-bytes, not actually ASCII.)

An ASCII-plus-arbitrary-bytes type called "str" would prevent claiming
"Python 3.4 compatibility" for sure.

Restricting strings to ASCII (as Chris apparently actually suggested)
would allow to claim compatibility with a stretch: existing Python
code might not run on such an implementation. However, since a lot
of existing Python code wouldn't run on MicroPython, anyway, one
might claim to implement a Python 3.4 subset.

> * Implementing Unicode internally as UTF-8, and giving up O(1)
> indexing operations.
>
> Would either of these trade-offs be acceptable while still claiming
> "Python 3.4 compatibility"?
>
> My own feeling is that O(1) string indexing operations are a quality of
> implementation issue, not a deal breaker to call it a Python. I can't
> see any requirement in the docs that str[n] must take O(1) time, but
> perhaps I have missed something.

I agree. It's an open question whether such an implementation would be
practical, both in terms of existing Python code, and in terms of existing
C extension modules that people might want to port to MicroPython.

There are more things to consider for the internal implementation,
in particular how the string length is implemented. Several alternatives
exist:
1. store the UTF-8 length (i.e. memory size)
2. store the number of code points (i.e. Python len())
3. store both
4. store neither, but use null termination instead

Variant 3 is most run-time efficient, but could easily use 8 bytes
just for the length, which could outweigh the storage of the actual
data. Variants 1 and 2 lose on some operations (1 loses on computing
len(), 2 loses on string concatenation). 3 would add the restriction
of not allowing U+0000 in a string (which would be reasonable IMO),
and make all length computations inefficient. However, it wouldn't
be worse than standard C.

Regards,
Martin