[Python-Dev] PEP 393 Summer of Code Project

Fri Aug 26 10:29:27 CEST 2011

On 26 August 2011 03:52, Guido van Rossum <guido at python.org> wrote:
> I know that by now I am repeating myself, but I think it would be
> really good if we could get rid of this ambiguity. PEP 393 seems the
> best way forward, even if it doesn't directly address what to do for
> IronPython or Jython, both of which have to deal with a pervasive
> native string type that contains UTF-16.

Hmm, I'm completely naive in this area, but from reading the thread,
would a possible approach be to say that Python (the language
definition) is defined in terms of code points (as we already do, even
if the wording might benefit from some clarification). Then, under PEP
393, and currently in wide builds, CPython conforms to that definition
(and retains the property of basic operations being O(1), which is not
in the language definition but is a user expectation and your
expressed requirement).

IronPython and Jython can retain UTF-16 as their native form if that
makes interop cleaner, but in doing so they need to ensure that basic
operations like indexing and len work in terms of code points, not
code units, if they are to conform. Presumably this will be easier
than moving to a UCS-4 representation, as they can defer to runtime
support routines via interop (which presumably get this right - or at
the very least can be blamed for any errors :-)) They lose the O(1)
guarantee, but that's easily defensible as a tradeoff to conform to
underlying runtime semantics.

Does this make sense, or have I completely misunderstood things?

Paul.

PS Thanks to all for the discussion in general, I'm learning a lot
about Unicode from all of this!