Re: [Python-Dev] PEP 393 Summer of Code Project

26 Aug 2011


      Guido van Rossum, 26.08.2011 19:02:
...
On Fri, Aug 26, 2011 at 3:29 AM, Stefan Behnel wrote:
...
Besides, what if these implementations provided indexing in, say, O(log N)
instead of O(1) or O(N), e.g. by building a tree index into each string? You
could have an index that simply marks runs of surrogate pairs and BMP
substrings, thus providing a likely-to-be-somewhat-compact index. That index
would obviously have to be built, but so do the different string
representations in post-PEP-393 CPython, especially on Windows, as I have
learned.
Eek. No, please.
I was mostly just confabulating. My main point was that this isn't a 
black-and-white thing - O(1) xor O(N) - and thus is orthogonal to the PEP. 
You can achieve compliant/acceptable behaviour at the code point level, the 
performance guarantees level or the platform integration level - choose any 
two. CPython is just lucky that there isn't really a platform integration 
level to take into account (if we leave the Windows environment aside for a 
moment).
...
Those platforms' native string types have length and
slicing operations that are O(1) and work in terms of 16-bit code
points. Python should use those. It would be awful if Java and Python
code doing the same manipulations on the same string would come to
different conclusions because Python tried to paper over surrogates.
I fully agree.
...
...
Would such a less severe violation of the strict O(1) rule still be "not
ok"? I think this is not such a clear black-and-white issue. Both
implementations have notably different performance characteristics than
CPython in some more or less important areas, as does PyPy. At some point,
the language compliance label has to account for that.
Since you had to ask, I have to declare that, indeed, non-O(1)
behavior would not be okay for those platforms.
I take it that you say that because you want strings to perform in the 
'normal' platform specific way here (i.e. like Java/.NET strings), and not 
so much because you want to require the exact same (performance) 
characteristics across Python implementations. So your choice is platform 
integration over code points, leaving the identical performance as a 
side-effect of the platform integration.
...
All in all, I don't think we should legislate Python strings to be
able to support 21-bit code points using O(1) indexing. PEP 393 makes
this possible for CPython, and it's been said that PyPy can follow
suit. But it'll be a "quality-of-implementation" issue, not built into
the language spec.
Makes sense to me. Most likely, Unicode heavy Python code will have to take 
platform specifics into account anyway, so there are limits as to what is 
suitable for a language spec.

Stefan