On 5 June 2014 17:54, Stephen J. Turnbull email@example.com wrote:
What matters to you is that str (unicode) is an opaque type -- there is no specification of the internal representation in the language reference, and in fact several different ones coexist happily across existing Python implementations -- and you're free to use a UTF-8 implementation if that suits the applications you expect for MicroPython.
However, as others have noted in the thread, the critical thing is to *not* let that internal implementation detail leak into the Python level string behaviour. That's what happened with narrow builds of Python 2 and pre-PEP-393 releases of Python 3 (effectively using UTF-16 internally), and it was the cause of a sufficiently large number of bugs that the Linux distributions tend to instead accept the memory cost of using wide builds (4 bytes for all code points) for affected versions.
Preserving the "the Python 3 str type is an immutable array of code points" semantics matters significantly more than whether or not indexing by code point is O(1). The various caching tricks suggested in this thread (especially "leading ASCII characters", "trailing ASCII characters" and "position & index of last lookup") could keep the typical lookup performance well below O(N).
PEP 393 exists, of course, and specifies the current internal representation for CPython 3. But I don't see anything in it that suggests it's mandated for any other implementation.
CPython is constrained by C API compatibility requirements, as well as implementation constraints due to the amount of internal code that would need to be rewritten to handle a variable width encoding as the canonical internal representation (since the problems with Python 2 narrow builds mean we already know variable width encodings aren't handled correctly by the current code).
Implementations that share code with CPython, or try to mimic the C API especially closely, may face similar restrictions. Outside that, I think we're better off if alternative implementations are free to experiment with different internal string representations.