[Python-Dev] UCS2/UCS4 default

Thu Jul 3 17:31:57 CEST 2008

Hello,

2008/7/3 Guido van Rossum <guido at python.org>:
> I don't see an answer there to the question of whether the length()
> method of a Java String object containing a single surrogate pair
> returns 1 or 2; I suspect it returns 2. Python 3 supports things like
> chr(0x12345) and ord("\U00012345"). (And so does Python 2, using
> unichr and unicode literals.)

python2.6 support for supplementary characters is not ideal:
>>> unichr(0x2f81a)
ValueError: unichr() arg not in range(0x10000) (narrow Python build)
>>> ord(u'\U0002F81A')
TypeError: ord() expected a character, but string of length 2 found.

\Uxxxxxxxx seems the only way to enter these characters.
3.0 is much better and passes the two tests above.

The unicodedata module gives good results in both versions:
>>> unicodedata.name(u'\U0002F81A')
'CJK COMPATIBILITY IDEOGRAPH-2F81A'
[34311 refs]
>>> unicodedata.category(u'\U0002F81A')
'Lo'

With python 3.0, I found only two places that refuse large code points
on narrow builds:
the "%c" format, and Py_BuildValue('C'). They should be fixed.

> The one thing that may be missing from Python is things like
> interpretation of surrogates by functions like isalpha() and I'm okay
> with adding that (since those have to loop over the entire string
> anyway).

In this case, a new .isascii() method would be needed for some uses.

-- 
Amaury Forgeot d'Arc