Re: [Python-Dev] Internal representation of strings and Micropython

4 Jun 2014

      Steven D'Aprano wrote:
...
The language semantics says that a string is an array of code points. Every
index relates to a single code point, no code point extends over two or more
indexes.
There's a 1:1 relationship between code points and indexes. How is direct
indexing "likely to be incorrect"?
We're discussing the behaviour under a different (hypothetical) design decision than a 1:1 relationship between code points and indexes, so arguing from that stance doesn't make much sense.
...
e.g.
s = "---ÿ---"
offset = s.index('ÿ')
assert s[offset] == 'ÿ'
That cannot fail with Python's semantics.
Agreed, and it shouldn't (I was actually referring to the optimization being incorrect for the goal, not the language semantics). What you'd probably find is that sizeof('ÿ') == sizeof(s[offset]) == 2, which may be surprising, but is also correct.

But what are you trying to achieve (why are you writing this code)? All this example really shows is that you're only using indexing for trivial purposes.

Chris's example of an actual case where it may look like a good idea to use indexing for optimization makes this more obvious IMHO:

Chris Angelico wrote:
...
Suppose you have a long title, and you need to abbreviate it by dropping out
words (delimited by whitespace), such that you keep the first word (always) and
the last (if possible) and as many as possible in between. How are you going to
write that? With PEP 393 or UTF-32 strings, you can simply record the index of
every whitespace you find, count off lengths, and decide what to keep and what
to ellipsize.
"Recording the index" is where the optimization comes in. With a variable-length encoding - heck, even with a fixed-length one - I'd just use str.split(' ') (or re.split('\\s', string), depending on how much I care about the type of delimiter) and manipulate the list.

If copying into a separate list is a problem (memory-wise), re.finditer('\\S+', string) also provides the same behaviour and gives me the sliced string, so there's no need to index for anything.

The downside is that it isn't as easy to teach as the 1:1 relationship, and currently it doesn't perform as well *in CPython*. But if MicroPython is focusing on size over speed, I don't see any reason why they shouldn't permit different performance characteristics and require a slightly different approach to highly-optimized coding.

In any case, this is an interesting discussion with a genuine effect on the Python interpreter ecosystem. Jython and IronPython already have different string implementations from CPython - having official (and hopefully flexible) guidance on deviations from the reference implementation would I think help other implementations provide even more value, which is only a good thing for Python.

Cheers,
Steve

Re: [Python-Dev] Internal representation of strings and Micropython

Steve Dower