
On 5 June 2014 22:01, Paul Sokolovsky pmiscml@gmail.com wrote:
All these changes are what let me dream on and speculate on possibility that Python4 could offer an encoding-neutral string type (which means based on bytes)
To me, an "encoding neutral string type" means roughly "characters are atomic", and the best representation we have for a "character" is a Unicode code point. Through any interface that provides "characters" each individual "character" (code point) is indivisible.
To me, Python 3 has exactly an "encoding-neutral string type". It also has a bytes type that is is just that - bytes which can represent anything at all.It might be the UTF-8 representation of a string, but you have the freedom to manipulate it however you like - including making it no longer valid UTF-8.
Whilst I think O(1) indexing of strings is important, I don't think it's as important as the property that "characters" are indivisible and would be quite happy for MicroPython to use UTF-8 as the underlying string representation (or some more clever thing, several ideas in this thread) so long as:
1. It maintains a string type that presents code points as indivisible elements;
2. The performance consequences of using UTF-8 are documented, as well as any optimisations, tricks, etc that are used to overcome those consequences (and what impact if any they would have if code written for MicroPython was run in CPython).
Cheers,
Tim Delaney