[Python-3000] How will unicode get used?

Thu Sep 21 00:41:15 CEST 2006

Guido writes:
> > As far as I can tell, CPython on windows uses UTF-16 with code units.
> > Perhaps not intentionally, but by default (not throwing an error on
> > surrogates).
>
> This is intentional, to be compatible with the rest of that platform.
> Jython and IronPython do this too I believe.

The following code illustrates this:

>>> msg = u'The ancient greeks used the letter "\U00010143" for the number 5.'
>>> msg[35:-18]
u'"\U00010143"'
>>> greek_five = msg[36:-19]
>>> len(greek_five)
2
>>> greek_five[0]
u'\ud800'
>>> greek_five[1]
u'\udd43'

The single unicode character greek_five, when expressed as a string
in CPython has length of 2 and can be sliced into two separate
characters. In Jython, the code above will not work because Jython
doesn't currently support \U or extended unicode (but someday that
may change). I'm not sure about IronPython.

So if I understand Guido's point, he's saying that it is on purpose
that len(greek_five) == 2. That's useful for compatibility today
with the Java and Microsoft VM platforms. But it's not particularly
compatible with extended Unicode. (Technically it doesn't violate
any rules so long as it's clearly defined that a character in Python
is NOT the same as a unicode code point.)

I wonder if it would be better to say that len(greek_five) is
undefined in Python. (And obviously slicing behavior follows from
len behavior.) There are excellent reasons for CPython to return
2 in the near future, but the far future is less clear. And the
Jython and Iron Python will be constrained by common sense to do
whatever their underlying platform does, even if that changes in
the future.

Designing these things would be a lot easier if we had a time
machine so we could go see how extended Unicode is used in practice
a decade or two from now.

Oh, wait....

-- Michael Chermside