[Python-Dev] PEP 393 Summer of Code Project

Fri Aug 26 20:14:17 CEST 2011

Guido van Rossum, 26.08.2011 19:02:
> On Fri, Aug 26, 2011 at 3:29 AM, Stefan Behnel wrote:
>> Besides, what if these implementations provided indexing in, say, O(log N)
>> instead of O(1) or O(N), e.g. by building a tree index into each string? You
>> could have an index that simply marks runs of surrogate pairs and BMP
>> substrings, thus providing a likely-to-be-somewhat-compact index. That index
>> would obviously have to be built, but so do the different string
>> representations in post-PEP-393 CPython, especially on Windows, as I have
>> learned.
>
> Eek. No, please.

I was mostly just confabulating. My main point was that this isn't a 
black-and-white thing - O(1) xor O(N) - and thus is orthogonal to the PEP. 
You can achieve compliant/acceptable behaviour at the code point level, the 
performance guarantees level or the platform integration level - choose any 
two. CPython is just lucky that there isn't really a platform integration 
level to take into account (if we leave the Windows environment aside for a 
moment).

> Those platforms' native string types have length and
> slicing operations that are O(1) and work in terms of 16-bit code
> points. Python should use those. It would be awful if Java and Python
> code doing the same manipulations on the same string would come to
> different conclusions because Python tried to paper over surrogates.

I fully agree.

>> Would such a less severe violation of the strict O(1) rule still be "not
>> ok"? I think this is not such a clear black-and-white issue. Both
>> implementations have notably different performance characteristics than
>> CPython in some more or less important areas, as does PyPy. At some point,
>> the language compliance label has to account for that.
>
> Since you had to ask, I have to declare that, indeed, non-O(1)
> behavior would not be okay for those platforms.

I take it that you say that because you want strings to perform in the 
'normal' platform specific way here (i.e. like Java/.NET strings), and not 
so much because you want to require the exact same (performance) 
characteristics across Python implementations. So your choice is platform 
integration over code points, leaving the identical performance as a 
side-effect of the platform integration.

> All in all, I don't think we should legislate Python strings to be
> able to support 21-bit code points using O(1) indexing. PEP 393 makes
> this possible for CPython, and it's been said that PyPy can follow
> suit. But it'll be a "quality-of-implementation" issue, not built into
> the language spec.

Makes sense to me. Most likely, Unicode heavy Python code will have to take 
platform specifics into account anyway, so there are limits as to what is 
suitable for a language spec.

Stefan