[Python-Dev] UCS2/UCS4 default

Thu Jul 3 16:46:48 CEST 2008

-On [20080703 15:58], Guido van Rossum (guido at python.org) wrote:
>Your seem to be suggesting that len(u"\U00012345") should return 1 on
>a system that internally uses UTF-16 and hence represents this string
>as a surrogate pair.

>From a Unicode and UTF-16 point of view that makes the most sense. So yes, I
am suggesting that.

>This is not going to happen. You may as well complain to the authors
>of the Java standard about the corresponding problem there.

Why would I need to complain to them? They already fixed it since 1.5.0.

Java 1.5.0's release notes
(http://java.sun.com/developer/technicalArticles/releases/j2se15/):

Supplementary Character Support

32-bit supplementary character support has been carefully added to the
platform as part of the transition to Unicode 4.0 support. Supplementary
characters are encoded as a special pair of UTF16 values to generate a
different character, or codepoint. A surrogate pair is a combination of a
high UTF16 value and a following low UTF16 value. The high and low values
are from a special range of UTF16 values.

In general, when using a String or sequence of characters, the core API
libraries will transparently handle the new supplementary characters for
you.

See also http://java.sun.com/j2se/1.5.0/docs/api/java/lang/Character.html

The methods that accept an int value support all Unicode characters,
including supplementary characters. For example, Character.isLetter(0x2F81A)
returns true because the code point value represents a letter (a CJK
ideograph).

-- 
Jeroen Ruigrok van der Werven <asmodai(-at-)in-nomine.org> / asmodai
イェルーン ラウフロック ヴァン デル ウェルヴェン
http://www.in-nomine.org/ | http://www.rangaku.org/ | GPG: 2EAC625B
Life can only be understood backwards, but it must be lived forwards...