[Python-Dev] UTF-16 code point comparison

Finn Bock bckfnn@worldonline.dk
Thu, 27 Jul 2000 18:01:01 GMT


[M.-A. Lemburg]

>BTW, does Java support UCS-4 ? If not, then Java is wrong
>here ;-)

Java claims to use unicode 2.1 [*]. I couldn't locate anything describing if
this is UCS-2 or UTF-16. I think unicode 2.1 includes UCS-4. The actual
level of support for UCS-4 is properly debatable. 

- The builtin char is 16bit wide and can obviously not support UCS-4.
- The Character class can report if a character is a surrogate:
    >>> from java.lang import Character
    >>> Character.getType("\ud800") == Character.SURROGATE
    1
- As reported, direct string comparison ignore surrogates.
- The BreakIterator does not handle surrogates. It does handle 
  combining characters and it seems a natural place to put support
  for surrogates.
- The Collator class offers different levels of normalization before
  comparing string but does not seem to support surrogates. This class
  seems a natural place for javasoft to put support for surrogates 
  during string comparison.


These findings are gleaned from the sources of JDK1.3

[*]
http://java.sun.com/docs/books/vmspec/2nd-edition/html/Concepts.doc.html#25310

regards,
finn