[Python-Dev] UCS2/UCS4 default

Thu Jul 3 18:35:29 CEST 2008

Paul Moore wrote:
> On 03/07/2008, Guido van Rossum <guido at python.org> wrote:
>> I don't see an answer there to the question of whether the length()
>> method of a Java String object containing a single surrogate pair
>> returns 1 or 2; I suspect it returns 2.
> 
> It appears you're right:
> 
>> type testucs.java
> class testucs {
>     public static void main(String[] args) {
>         StringBuilder s = new StringBuilder("Hello, ");
>         s.appendCodePoint(0x2F81A);
>         System.out.println(s); // Display the string.
>         System.out.println(s.length());
>     }
> }
> 
>> java testucs
> Hello, ?
> 9
> 
>> java -version
> java version "1.6.0_05"
> Java(TM) SE Runtime Environment (build 1.6.0_05-b13)
> Java HotSpot(TM) Client VM (build 10.0-b19, mixed mode, sharing)
> 
>> Python 3 supports things like
>> chr(0x12345) and ord("\U00012345"). (And so does Python 2, using
>> unichr and unicode literals.)
> 
> And Java doesn't appear to - that appendCodePoint() method was
> wonderfully hard to find :-)
> 
There's also the issue of indexing the Unicode strings. If we are going 
to insist that len(u) counts surrogate pairs as one character then 
random access to the characters of a string is going to be an extremely 
inefficient operation.

Surely it's desirable under all circumstances that

   len(u) == sum(1 for c in u)

and that

   [c for c in u] == [c[i] for i in range(*len(u))]

How would that play under Jeroen's proposed change?

regards
  Steve
-- 
Steve Holden        +1 571 484 6266   +1 800 494 3119
Holden Web LLC              http://www.holdenweb.com/