
On Wed, Aug 24, 2011 at 1:22 AM, Stephen J. Turnbull <turnbull@sk.tsukuba.ac.jp> wrote:
Well, no, it gives the right answer according to the design. unicode objects do not contain character strings. By design, they contain code point strings. Guido has made that absolutely clear on a number of occasions.
Actually, the situation is that in narrow builds, they contain code units (which may have surrogates); in wide builds they contain code points. I think this is the crux of Tom Christian's complaints about narrow builds. Here's proof that narrow builds contain code units, not code points (i.e. use UTF-16, not UCS-2): $ ./python Python 2.7.2+ (2.7:498b03a55297, Aug 25 2011, 15:14:01) [GCC 4.4.3] on linux2 Type "help", "copyright", "credits" or "license" for more information.
import sys sys.maxunicode 65535 a = u'\U00012345' a u'\U00012345' len(a) 2
It's pretty clear that the interpreter is surrogate-aware, which to me indicates the use of UTF-16. Now in the PEP 393 branch: ./python Python 3.3.0a0 (pep-393:c60556059719, Aug 25 2011, 15:31:05) [GCC 4.4.3] on linux2 Type "help", "copyright", "credits" or "license" for more information.
import sys sys.maxunicode 1114111 a = '\U00012345' a '𒍅' len(a) 1
And some proof that this branch does not care about surrogates:
a = '\ud808' b = '\udf45' a '\ud808' b '\udf45' a + b '\ud808\udf45' len(a+b) 2
However: a = '\ud808\udf45'
a '𒍅' len(a) 1
Which to me merely shows it is smart when parsing string literals. (I expect that regular 3.3 narrow builds behave similar to the 2.7 narrow build, and 3.3 wide builds behave similar to the pep-393 build; I didn't have those lying around.) -- --Guido van Rossum (python.org/~guido)