[Python-Dev] Internal representation of strings and Micropython

Wed Jun 4 17:32:25 CEST 2014

Steven D'Aprano wrote:
> The language semantics says that a string is an array of code points. Every
> index relates to a single code point, no code point extends over two or more
> indexes.
> There's a 1:1 relationship between code points and indexes. How is direct
> indexing "likely to be incorrect"?

We're discussing the behaviour under a different (hypothetical) design decision than a 1:1 relationship between code points and indexes, so arguing from that stance doesn't make much sense.

> e.g.
> 
> s = "---ÿ---"
> offset = s.index('ÿ')
> assert s[offset] == 'ÿ'
> 
> That cannot fail with Python's semantics.

Agreed, and it shouldn't (I was actually referring to the optimization being incorrect for the goal, not the language semantics). What you'd probably find is that sizeof('ÿ') == sizeof(s[offset]) == 2, which may be surprising, but is also correct.

But what are you trying to achieve (why are you writing this code)? All this example really shows is that you're only using indexing for trivial purposes.

Chris's example of an actual case where it may look like a good idea to use indexing for optimization makes this more obvious IMHO:

Chris Angelico wrote:
> Suppose you have a long title, and you need to abbreviate it by dropping out
> words (delimited by whitespace), such that you keep the first word (always) and
> the last (if possible) and as many as possible in between. How are you going to
> write that? With PEP 393 or UTF-32 strings, you can simply record the index of
> every whitespace you find, count off lengths, and decide what to keep and what
> to ellipsize.

"Recording the index" is where the optimization comes in. With a variable-length encoding - heck, even with a fixed-length one - I'd just use str.split(' ') (or re.split('\\s', string), depending on how much I care about the type of delimiter) and manipulate the list.

If copying into a separate list is a problem (memory-wise), re.finditer('\\S+', string) also provides the same behaviour and gives me the sliced string, so there's no need to index for anything.

The downside is that it isn't as easy to teach as the 1:1 relationship, and currently it doesn't perform as well *in CPython*. But if MicroPython is focusing on size over speed, I don't see any reason why they shouldn't permit different performance characteristics and require a slightly different approach to highly-optimized coding.

In any case, this is an interesting discussion with a genuine effect on the Python interpreter ecosystem. Jython and IronPython already have different string implementations from CPython - having official (and hopefully flexible) guidance on deviations from the reference implementation would I think help other implementations provide even more value, which is only a good thing for Python.

Cheers,
Steve