Steven D'Aprano wrote:
(1) I asked if it would be okay for MicroPython to *optionally* use nominally Unicode strings limited to ASCII. Pretty much the only response to this as been Guido saying "That would be a pretty lousy option", and since nobody has really defended the suggestion, I think we can assume that it's off the table.
Lousy is not quite the same as forbidden.
Doing it in good faith would require making the limit prominent in the documentation, and raising some sort of CharacterNotSupported exception (or at least a warning) whenever there is an attempt to create a non-ASCII string, even via the C API.
(2) I asked if it would be okay ... to use an UTF-8 implementation even though it would lead to O(N) indexing operations instead of O(1). There's been some opposition to this, including Guido's:
[Non-ASCII character removed.]
It is bad when quirks -- even good quirks -- of one implementation lead people to write code that will perform badly on a different Python implementation. Cpython has at least delayed obvious optimizations for this reason. Changing idiomatic operations from O(1) to O(N) is big enough to cause a concern.
That said, the target environment itself apparently limits N to small enough that the problem should be mostly theoretical. If you want to be good citizens, then do put a note in the documentation warning that particularly long strings are likely to cause performance issues unique to the MicroPython implementation.
(Frankly, my personal opinion is that if you're really optimizing for space, then long strings will start getting awkward long before N is big enough for algorithmic complexity to overcome constant factors.)
... those strings will need to be transcoded to UTF-8 before they can be written or printed, so keeping them as UTF-8 ...
That all assumes that the external world is using UTF-8 anyhow.
Which is more likely to be true if you document it as a limitation of MicroPython.
... but many strings may never be written out:
print(prefix + s[1:].strip().lower().center(80) + suffix)
creates five strings that are never written out and one that is.
But looking at the actual strings -- UTF-8 doesn't really hurt much. Only the slice and center() are more complex, and for a string less than 80 characters long, O(N) is irrelevant.
If there are still threading problems with my replies, please email me with details, so that I can try to resolve them. -jJ