Re: [Python-Dev] PEP 393 Summer of Code Project

26 Aug 2011

      On Fri, Aug 26, 2011 at 2:29 AM, "Martin v. Löwis"  wrote:
...
...
IronPython and Jython can retain UTF-16 as their native form if that
makes interop cleaner, but in doing so they need to ensure that basic
operations like indexing and len work in terms of code points, not
code units, if they are to conform.
That means that they won't conform, period. There is no efficient
maintainable implementation strategy to achieve that property, and
it may take well years until somebody provides an efficient
unmaintainable implementation.
...
Does this make sense, or have I completely misunderstood things?
You seem to assume it is ok for Jython/IronPython to provide indexing in
O(n). It is not.
Indeed.
...
However, non-conformance may not be that much of an issue. They do not
conform in many other aspects, either (such as not supporting Python 3,
for example, or not supporting the C API) that they may well chose to
ignore such a minor requirement if there was one. For BMP strings,
they conform fine, and it may well be that Jython eithers either don't
have non-BMP strings, or don't care whether len() or indexing of their
non-BMP strings is "correct".
I think this is fine. I had been hoping that all Python
implementations claiming compatibility with version 3.3 of the
language reference would be free of worries about surrogates, but it
simply doesn't make sense.

And yes, I'm well aware that PEP 393 is only for CPython. It's just
that I had hoped that it would get rid of some of Tom C's specific
complaints for all Python implementations; but it really seems
impossible to do so.

One consequence may be that the standard library, to the extent it is
shared by other implementations, may still have to worry about
surrogates and other issues inherent in narrow builds or other
16-bit-based string types. We'll cross that bridge when we get to it.

-- 
--Guido van Rossum (python.org/~guido)