
On Wed, Jun 04, 2014 at 01:14:04PM +0000, Steve Dower wrote:
I'm agree with Daniel. Directly indexing into text suggests an attempted optimization that is likely to be incorrect for a set of strings.
I'm afraid I don't understand this argument. The language semantics says that a string is an array of code points. Every index relates to a single code point, no code point extends over two or more indexes. There's a 1:1 relationship between code points and indexes. How is direct indexing "likely to be incorrect"?
e.g.
s = "---ÿ---" offset = s.index('ÿ') assert s[offset] == 'ÿ'
That cannot fail with Python's semantics.
[Aside: it does fail in Python 2, showing that the idea that "strings are bytes" is fatally broken. Fortunately Python has moved beyond that.]
Splitting, regex, concatenation and formatting are really the main operations that matter, and MicroPython can optimize their implementation of these easily enough for O(N) indexing.
Really? Well, it will be a nice experiment. Fortunately MicroPython runs under Linux as well as on embedded systems (a clever decision, by the way) so I look forward to seeing how their internal-utf8 implementation stacks up against CPython's FSR implementation.
Out of curiosity, when the FSR was proposed, did anyone consider an internal UTF-8 representation? If so, why was it rejected?