Guido van Rossum, 26.08.2011 19:02:
On Fri, Aug 26, 2011 at 3:29 AM, Stefan Behnel wrote:
Besides, what if these implementations provided indexing in, say, O(log N) instead of O(1) or O(N), e.g. by building a tree index into each string? You could have an index that simply marks runs of surrogate pairs and BMP substrings, thus providing a likely-to-be-somewhat-compact index. That index would obviously have to be built, but so do the different string representations in post-PEP-393 CPython, especially on Windows, as I have learned.
Eek. No, please.
I was mostly just confabulating. My main point was that this isn't a black-and-white thing - O(1) xor O(N) - and thus is orthogonal to the PEP. You can achieve compliant/acceptable behaviour at the code point level, the performance guarantees level or the platform integration level - choose any two. CPython is just lucky that there isn't really a platform integration level to take into account (if we leave the Windows environment aside for a moment).
Those platforms' native string types have length and slicing operations that are O(1) and work in terms of 16-bit code points. Python should use those. It would be awful if Java and Python code doing the same manipulations on the same string would come to different conclusions because Python tried to paper over surrogates.
I fully agree.
Would such a less severe violation of the strict O(1) rule still be "not ok"? I think this is not such a clear black-and-white issue. Both implementations have notably different performance characteristics than CPython in some more or less important areas, as does PyPy. At some point, the language compliance label has to account for that.
Since you had to ask, I have to declare that, indeed, non-O(1) behavior would not be okay for those platforms.
I take it that you say that because you want strings to perform in the 'normal' platform specific way here (i.e. like Java/.NET strings), and not so much because you want to require the exact same (performance) characteristics across Python implementations. So your choice is platform integration over code points, leaving the identical performance as a side-effect of the platform integration.
All in all, I don't think we should legislate Python strings to be able to support 21-bit code points using O(1) indexing. PEP 393 makes this possible for CPython, and it's been said that PyPy can follow suit. But it'll be a "quality-of-implementation" issue, not built into the language spec.
Makes sense to me. Most likely, Unicode heavy Python code will have to take platform specifics into account anyway, so there are limits as to what is suitable for a language spec. Stefan