[Python-Dev] Internal representation of strings and Micropython
Steven D'Aprano
steve at pearwood.info
Thu Jun 5 15:23:12 CEST 2014
On Wed, Jun 04, 2014 at 11:17:18AM +1000, Steven D'Aprano wrote:
> There is a discussion over at MicroPython about the internal
> representation of Unicode strings. Micropython is aimed at embedded
> devices, and so minimizing memory use is important, possibly even
> more important than performance.
[...]
Wow! I'm amazed at the response here, since I expected it would have
received a fairly brief "Yes" or "No" response, not this long thread.
Here is a summary (as best as I am able) of a few points which I think
are important:
(1) I asked if it would be okay for MicroPython to *optionally* use
nominally Unicode strings limited to ASCII. Pretty much the only
response to this as been Guido saying "That would be a pretty lousy
option", and since nobody has really defended the suggestion, I think we
can assume that it's off the table.
(2) I asked if it would be okay for µPy to use an UTF-8 implementation
even though it would lead to O(N) indexing operations instead of O(1).
There's been some opposition to this, including Guido's:
Then again the UTF-8 option would be pretty devastating
too for anything manipulating strings (especially since
many Python APIs are defined using indexes, e.g. the re
module).
but unless Guido wants to say different, I think the consensus is that
a UTF-8 implementation is allowed, even at the cost of O(N) indexing
operations. Saving memory -- assuming that it does save memory, which I
think is an assumption and not proven -- over time is allowed.
(3) It seems to me that there's been a lot of theorizing about what
implementation will be obviously more efficient. Folks, how about some
benchmarks before making claims about code efficiency? :-)
(4) Similarly, there have been many suggestions more suited in my
opinion to python-ideas, or even python-list, for ways to implement O(1)
indexing on top of UTF-8. Some of them involve per-string mutable state
(e.g. the last index seen), or complicated int sub-classes that need to
know what string they come from. Remember your Zen please:
Simple is better than complex.
Complex is better than complicated.
...
If the implementation is hard to explain, it's a bad idea.
(5) I'm not convinced that UTF-8 internally is *necessarily* more
efficient, but look forward to seeing the result of benchmarks. The
rationale of internal UTF-8 is that the use of any other encoding
internally will be inefficient since those strings will need to be
transcoded to UTF-8 before they can be written or printed, so keeping
them as UTF-8 in the first place saves the transcoding step. Well, yes,
but many strings may never be written out:
print(prefix + s[1:].strip().lower().center(80) + suffix)
creates five strings that are never written out and one that is. So if
the internal encoding of strings is more efficient than UTF-8, and most
of them never need transcoding to UTF-8, a non-UTF-8 internal format
might be a nett win. So I'm looking forward to seeing the results of
µPy's experiments with it.
Thanks to all who have commented.
--
Steven
More information about the Python-Dev
mailing list