[Python-Dev] UTF-16 code point comparison

M.-A. Lemburg mal@lemburg.com
Thu, 27 Jul 2000 22:22:09 +0200


Finn Bock wrote:
> 
> [M.-A. Lemburg]
> 
> >BTW, does Java support UCS-4 ? If not, then Java is wrong
> >here ;-)
> 
> Java claims to use unicode 2.1 [*]. I couldn't locate anything describing if
> this is UCS-2 or UTF-16. I think unicode 2.1 includes UCS-4. The actual
> level of support for UCS-4 is properly debatable.
> 
> - The builtin char is 16bit wide and can obviously not support UCS-4.
> - The Character class can report if a character is a surrogate:
>     >>> from java.lang import Character
>     >>> Character.getType("\ud800") == Character.SURROGATE
>     1

>>> unicodedata.category(u'\ud800')
'Cs'

... which means the same thing only in Unicode3 standards
notation.

Make me think: perhaps we should add the Java constants to
unicodedata base. Is there a list of those available
somewhere ?

> - As reported, direct string comparison ignore surrogates.

I would guess that this'll have to change as soon as JavaSoft
folks realize that they need to handle UTF-16 and not only
UCS-2.

> - The BreakIterator does not handle surrogates. It does handle
>   combining characters and it seems a natural place to put support
>   for surrogates.

What is a BreakIterator ? An iterator to scan line/word breaks ?

> - The Collator class offers different levels of normalization before
>   comparing string but does not seem to support surrogates. This class
>   seems a natural place for javasoft to put support for surrogates
>   during string comparison.

We'll need something like this for 2.1 too: first some
standard APIs for normalization and then a few unicmp()
APIs to use for sorting.

We might even have to add collation sequences somewhere because
this is a locale issue as well... sometimes it's even worse
with different strategies used for different tasks within one
locale, e.g. in Germany we sometimes sort the Umlaut ä as "ae"
and at other times as extra character.
 
> These findings are gleaned from the sources of JDK1.3
> 
> [*]
> http://java.sun.com/docs/books/vmspec/2nd-edition/html/Concepts.doc.html#25310
> 

Thanks for the infos,
-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/