[Python-Dev] UCS2/UCS4 default

Thu Jul 3 17:03:55 CEST 2008

On Thu, Jul 3, 2008 at 7:46 AM, Jeroen Ruigrok van der Werven
<asmodai at in-nomine.org> wrote:
> -On [20080703 15:58], Guido van Rossum (guido at python.org) wrote:
>>Your seem to be suggesting that len(u"\U00012345") should return 1 on
>>a system that internally uses UTF-16 and hence represents this string
>>as a surrogate pair.
>
> From a Unicode and UTF-16 point of view that makes the most sense. So yes, I
> am suggesting that.
>
>>This is not going to happen. You may as well complain to the authors
>>of the Java standard about the corresponding problem there.
>
> Why would I need to complain to them? They already fixed it since 1.5.0.
>
> Java 1.5.0's release notes
> (http://java.sun.com/developer/technicalArticles/releases/j2se15/):
>
> Supplementary Character Support
>
> 32-bit supplementary character support has been carefully added to the
> platform as part of the transition to Unicode 4.0 support. Supplementary
> characters are encoded as a special pair of UTF16 values to generate a
> different character, or codepoint. A surrogate pair is a combination of a
> high UTF16 value and a following low UTF16 value. The high and low values
> are from a special range of UTF16 values.
>
> In general, when using a String or sequence of characters, the core API
> libraries will transparently handle the new supplementary characters for
> you.
>
> See also http://java.sun.com/j2se/1.5.0/docs/api/java/lang/Character.html
>
> The methods that accept an int value support all Unicode characters,
> including supplementary characters. For example, Character.isLetter(0x2F81A)
> returns true because the code point value represents a letter (a CJK
> ideograph).

I don't see an answer there to the question of whether the length()
method of a Java String object containing a single surrogate pair
returns 1 or 2; I suspect it returns 2. Python 3 supports things like
chr(0x12345) and ord("\U00012345"). (And so does Python 2, using
unichr and unicode literals.)

The one thing that may be missing from Python is things like
interpretation of surrogates by functions like isalpha() and I'm okay
with adding that (since those have to loop over the entire string
anyway).

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)