On Fri, Aug 26, 2011 at 2:29 AM, "Martin v. Löwis"
IronPython and Jython can retain UTF-16 as their native form if that makes interop cleaner, but in doing so they need to ensure that basic operations like indexing and len work in terms of code points, not code units, if they are to conform.
That means that they won't conform, period. There is no efficient maintainable implementation strategy to achieve that property, and it may take well years until somebody provides an efficient unmaintainable implementation.
Does this make sense, or have I completely misunderstood things?
You seem to assume it is ok for Jython/IronPython to provide indexing in O(n). It is not.
Indeed.
However, non-conformance may not be that much of an issue. They do not conform in many other aspects, either (such as not supporting Python 3, for example, or not supporting the C API) that they may well chose to ignore such a minor requirement if there was one. For BMP strings, they conform fine, and it may well be that Jython eithers either don't have non-BMP strings, or don't care whether len() or indexing of their non-BMP strings is "correct".
I think this is fine. I had been hoping that all Python implementations claiming compatibility with version 3.3 of the language reference would be free of worries about surrogates, but it simply doesn't make sense. And yes, I'm well aware that PEP 393 is only for CPython. It's just that I had hoped that it would get rid of some of Tom C's specific complaints for all Python implementations; but it really seems impossible to do so. One consequence may be that the standard library, to the extent it is shared by other implementations, may still have to worry about surrogates and other issues inherent in narrow builds or other 16-bit-based string types. We'll cross that bridge when we get to it. -- --Guido van Rossum (python.org/~guido)