[Python-Dev] UTF-16 code point comparison
Finn Bock
bckfnn@worldonline.dk
Thu, 27 Jul 2000 18:01:01 GMT
[M.-A. Lemburg]
>BTW, does Java support UCS-4 ? If not, then Java is wrong
>here ;-)
Java claims to use unicode 2.1 [*]. I couldn't locate anything describing if
this is UCS-2 or UTF-16. I think unicode 2.1 includes UCS-4. The actual
level of support for UCS-4 is properly debatable.
- The builtin char is 16bit wide and can obviously not support UCS-4.
- The Character class can report if a character is a surrogate:
>>> from java.lang import Character
>>> Character.getType("\ud800") == Character.SURROGATE
1
- As reported, direct string comparison ignore surrogates.
- The BreakIterator does not handle surrogates. It does handle
combining characters and it seems a natural place to put support
for surrogates.
- The Collator class offers different levels of normalization before
comparing string but does not seem to support surrogates. This class
seems a natural place for javasoft to put support for surrogates
during string comparison.
These findings are gleaned from the sources of JDK1.3
[*]
http://java.sun.com/docs/books/vmspec/2nd-edition/html/Concepts.doc.html#25310
regards,
finn