On Wed, Jun 04, 2014 at 11:17:18AM +1000, Steven D'Aprano wrote:
There is a discussion over at MicroPython about the internal representation of Unicode strings. Micropython is aimed at embedded devices, and so minimizing memory use is important, possibly even more important than performance.
Wow! I'm amazed at the response here, since I expected it would have received a fairly brief "Yes" or "No" response, not this long thread. Here is a summary (as best as I am able) of a few points which I think are important:
(1) I asked if it would be okay for MicroPython to *optionally* use nominally Unicode strings limited to ASCII. Pretty much the only response to this as been Guido saying "That would be a pretty lousy option", and since nobody has really defended the suggestion, I think we can assume that it's off the table.
(2) I asked if it would be okay for µPy to use an UTF-8 implementation even though it would lead to O(N) indexing operations instead of O(1). There's been some opposition to this, including Guido's:
Then again the UTF-8 option would be pretty devastating too for anything manipulating strings (especially since many Python APIs are defined using indexes, e.g. the re module).
but unless Guido wants to say different, I think the consensus is that a UTF-8 implementation is allowed, even at the cost of O(N) indexing operations. Saving memory -- assuming that it does save memory, which I think is an assumption and not proven -- over time is allowed.
(3) It seems to me that there's been a lot of theorizing about what implementation will be obviously more efficient. Folks, how about some benchmarks before making claims about code efficiency? :-)
(4) Similarly, there have been many suggestions more suited in my opinion to python-ideas, or even python-list, for ways to implement O(1) indexing on top of UTF-8. Some of them involve per-string mutable state (e.g. the last index seen), or complicated int sub-classes that need to know what string they come from. Remember your Zen please:
Simple is better than complex. Complex is better than complicated. ... If the implementation is hard to explain, it's a bad idea.
(5) I'm not convinced that UTF-8 internally is *necessarily* more efficient, but look forward to seeing the result of benchmarks. The rationale of internal UTF-8 is that the use of any other encoding internally will be inefficient since those strings will need to be transcoded to UTF-8 before they can be written or printed, so keeping them as UTF-8 in the first place saves the transcoding step. Well, yes, but many strings may never be written out:
print(prefix + s[1:].strip().lower().center(80) + suffix)
creates five strings that are never written out and one that is. So if the internal encoding of strings is more efficient than UTF-8, and most of them never need transcoding to UTF-8, a non-UTF-8 internal format might be a nett win. So I'm looking forward to seeing the results of µPy's experiments with it.
Thanks to all who have commented.