[Python-Dev] UCS2/UCS4 default

Jeroen Ruigrok van der Werven asmodai at in-nomine.org
Thu Jul 3 18:51:40 CEST 2008


-On [20080703 17:03], Guido van Rossum (guido at python.org) wrote:
>I don't see an answer there to the question of whether the length()
>method of a Java String object containing a single surrogate pair
>returns 1 or 2; I suspect it returns 2.

As
http://java.sun.com/j2se/1.5.0/docs/api/java/lang/CharSequence.html#length()
states:

int length()

Returns the length of this character sequence. The length is the number of
16-bit chars in the sequence. 

But since Java switched to full UTF-16 support in 1.5.0 they extended their
API since the existing methods have probably come too ingrained.

E.g. codePointCount()
http://java.sun.com/j2se/1.5.0/docs/api/java/lang/Character.html#codePointCount(char[],%20int,%20int)

>The one thing that may be missing from Python is things like
>interpretation of surrogates by functions like isalpha() and I'm okay
>with adding that (since those have to loop over the entire string
>anyway).

Those would be welcome already, yes. I'll see if I can help out.

-- 
Jeroen Ruigrok van der Werven <asmodai(-at-)in-nomine.org> / asmodai
イェルーン ラウフロック ヴァン デル ウェルヴェン
http://www.in-nomine.org/ | http://www.rangaku.org/ | GPG: 2EAC625B
Fallen into ever-mourn, with these wings so torn, after your day my dawn...


More information about the Python-Dev mailing list