[Python-Dev] PEP 393 Summer of Code Project

Fri Aug 26 19:02:46 CEST 2011

On Fri, Aug 26, 2011 at 3:29 AM, Stefan Behnel <stefan_ml at behnel.de> wrote:
> "Martin v. Löwis", 26.08.2011 11:29:
>>
>> You seem to assume it is ok for Jython/IronPython to provide indexing in
>> O(n). It is not.
>
> I think we can leave this discussion aside.

(And yet, you keep arguing. :-)

> Jython and IronPython have their
> own platform specific constraints to which they need to adapt their
> implementation. For a Jython user, it means a lot to be able to efficiently
> pass strings (and other data) back and forth between Jython and other JVM
> code, and it's not hard to guess that the same is true for IronPython/.NET
> users. After all, the platform integration is the very *reason* for most
> users to select one of these implementations.

Right.

> Besides, what if these implementations provided indexing in, say, O(log N)
> instead of O(1) or O(N), e.g. by building a tree index into each string? You
> could have an index that simply marks runs of surrogate pairs and BMP
> substrings, thus providing a likely-to-be-somewhat-compact index. That index
> would obviously have to be built, but so do the different string
> representations in post-PEP-393 CPython, especially on Windows, as I have
> learned.

Eek. No, please. Those platforms' native string types have length and
slicing operations that are O(1) and work in terms of 16-bit code
points. Python should use those. It would be awful if Java and Python
code doing the same manipulations on the same string would come to
different conclusions because Python tried to paper over surrogates.

I dug up some evidence for Java, at least:

http://download.oracle.com/javase/1,5.0/docs/api/java/lang/CharSequence.html#length%28%29

"""
length

int length()

    Returns the length of this character sequence. The length is the
number of 16-bit chars in the sequence.

    Returns:
        the number of chars in this sequence
"""

This is quite explicit about counting 16-bit code units. I've found
similar info about .NET, which defines "char" as a 16-bit quantity and
string length in terms of the number of "char" items.

> Would such a less severe violation of the strict O(1) rule still be "not
> ok"? I think this is not such a clear black-and-white issue. Both
> implementations have notably different performance characteristics than
> CPython in some more or less important areas, as does PyPy. At some point,
> the language compliance label has to account for that.

Since you had to ask, I have to declare that, indeed, non-O(1)
behavior would not be okay for those platforms.

All in all, I don't think we should legislate Python strings to be
able to support 21-bit code points using O(1) indexing. PEP 393 makes
this possible for CPython, and it's been said that PyPy can follow
suit. But it'll be a "quality-of-implementation" issue, not built into
the language spec.

-- 
--Guido van Rossum (python.org/~guido)