[Python-Dev] Internal representation of strings and Micropython

Thu Jun 5 15:23:12 CEST 2014

On Wed, Jun 04, 2014 at 11:17:18AM +1000, Steven D'Aprano wrote:
> There is a discussion over at MicroPython about the internal 
> representation of Unicode strings. Micropython is aimed at embedded 
> devices, and so minimizing memory use is important, possibly even 
> more important than performance.
[...]

Wow! I'm amazed at the response here, since I expected it would have 
received a fairly brief "Yes" or "No" response, not this long thread. 
Here is a summary (as best as I am able) of a few points which I think 
are important:

(1) I asked if it would be okay for MicroPython to *optionally* use 
nominally Unicode strings limited to ASCII. Pretty much the only 
response to this as been Guido saying "That would be a pretty lousy 
option", and since nobody has really defended the suggestion, I think we 
can assume that it's off the table.

(2) I asked if it would be okay for µPy to use an UTF-8 implementation 
even though it would lead to O(N) indexing operations instead of O(1). 
There's been some opposition to this, including Guido's:

    Then again the UTF-8 option would be pretty devastating 
    too for anything manipulating strings (especially since 
    many Python APIs are defined using indexes, e.g. the re 
    module).

but unless Guido wants to say different, I think the consensus is that 
a UTF-8 implementation is allowed, even at the cost of O(N) indexing 
operations. Saving memory -- assuming that it does save memory, which I 
think is an assumption and not proven -- over time is allowed.

(3) It seems to me that there's been a lot of theorizing about what 
implementation will be obviously more efficient. Folks, how about some 
benchmarks before making claims about code efficiency? :-)

(4) Similarly, there have been many suggestions more suited in my 
opinion to python-ideas, or even python-list, for ways to implement O(1) 
indexing on top of UTF-8. Some of them involve per-string mutable state 
(e.g. the last index seen), or complicated int sub-classes that need to 
know what string they come from. Remember your Zen please:

    Simple is better than complex.
    Complex is better than complicated.
    ...
    If the implementation is hard to explain, it's a bad idea.

(5) I'm not convinced that UTF-8 internally is *necessarily* more 
efficient, but look forward to seeing the result of benchmarks. The 
rationale of internal UTF-8 is that the use of any other encoding 
internally will be inefficient since those strings will need to be 
transcoded to UTF-8 before they can be written or printed, so keeping 
them as UTF-8 in the first place saves the transcoding step. Well, yes, 
but many strings may never be written out:

    print(prefix + s[1:].strip().lower().center(80) + suffix)

creates five strings that are never written out and one that is. So if 
the internal encoding of strings is more efficient than UTF-8, and most 
of them never need transcoding to UTF-8, a non-UTF-8 internal format 
might be a nett win. So I'm looking forward to seeing the results of 
µPy's experiments with it.

Thanks to all who have commented.

-- 
Steven