Re: [Python-Dev] PEP 393 Summer of Code Project

Aug. 25, 2011

      On Wed, Aug 24, 2011 at 1:22 AM, Stephen J. Turnbull
<turnbull@sk.tsukuba.ac.jp> wrote:
...
Well, no, it gives the right answer according to the design.  unicode
objects do not contain character strings.  By design, they contain
code point strings.  Guido has made that absolutely clear on a number
of occasions.
Actually, the situation is that in narrow builds, they contain code
units (which may have surrogates); in wide builds they contain code
points. I think this is the crux of Tom Christian's complaints about
narrow builds.

Here's proof that narrow builds contain code units, not code points
(i.e. use UTF-16, not UCS-2):

$ ./python
Python 2.7.2+ (2.7:498b03a55297, Aug 25 2011, 15:14:01)
[GCC 4.4.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
...
...
...
import sys
sys.maxunicode
65535
a = u'\U00012345'
a
u'\U00012345'
len(a)
2
It's pretty clear that the interpreter is surrogate-aware, which to me
indicates the use of UTF-16.

Now in the PEP 393 branch:

./python
Python 3.3.0a0 (pep-393:c60556059719, Aug 25 2011, 15:31:05)
[GCC 4.4.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
...
...
...
import sys
sys.maxunicode
1114111
a = '\U00012345'
a
'𒍅'
len(a)
1
And some proof that this branch does not care about surrogates:
...
...
...
a = '\ud808'
b = '\udf45'
a
'\ud808'
b
'\udf45'
a + b
'\ud808\udf45'
len(a+b)
2
However:

a = '\ud808\udf45'
...
...
...
a
'𒍅'
len(a)
1
Which to me merely shows it is smart when parsing string literals.

(I expect that regular 3.3 narrow builds behave similar to the 2.7
narrow build, and 3.3 wide builds behave similar to the pep-393 build;
I didn't have those lying around.)

-- 
--Guido van Rossum (python.org/~guido)