[Python-Dev] PEP 393 Summer of Code Project

Fri Aug 26 00:44:38 CEST 2011

On Wed, Aug 24, 2011 at 1:22 AM, Stephen J. Turnbull
<turnbull at sk.tsukuba.ac.jp> wrote:
> Well, no, it gives the right answer according to the design.  unicode
> objects do not contain character strings.  By design, they contain
> code point strings.  Guido has made that absolutely clear on a number
> of occasions.

Actually, the situation is that in narrow builds, they contain code
units (which may have surrogates); in wide builds they contain code
points. I think this is the crux of Tom Christian's complaints about
narrow builds.

Here's proof that narrow builds contain code units, not code points
(i.e. use UTF-16, not UCS-2):

$ ./python
Python 2.7.2+ (2.7:498b03a55297, Aug 25 2011, 15:14:01)
[GCC 4.4.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> sys.maxunicode
65535
>>> a = u'\U00012345'
>>> a
u'\U00012345'
>>> len(a)
2
>>>

It's pretty clear that the interpreter is surrogate-aware, which to me
indicates the use of UTF-16.

Now in the PEP 393 branch:

./python
Python 3.3.0a0 (pep-393:c60556059719, Aug 25 2011, 15:31:05)
[GCC 4.4.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> sys.maxunicode
1114111
>>> a = '\U00012345'
>>> a
'𒍅'
>>> len(a)
1
>>>

And some proof that this branch does not care about surrogates:

>>> a = '\ud808'
>>> b = '\udf45'
>>> a
'\ud808'
>>> b
'\udf45'
>>> a + b
'\ud808\udf45'
>>> len(a+b)
2
>>>

However:

a = '\ud808\udf45'
>>> a
'𒍅'
>>> len(a)
1
>>>

Which to me merely shows it is smart when parsing string literals.

(I expect that regular 3.3 narrow builds behave similar to the 2.7
narrow build, and 3.3 wide builds behave similar to the pep-393 build;
I didn't have those lying around.)

-- 
--Guido van Rossum (python.org/~guido)