[Python-Dev] UTF-16 code point comparison

M.-A. Lemburg mal@lemburg.com
Thu, 27 Jul 2000 22:22:09 +0200

Finn Bock wrote:
> [M.-A. Lemburg]
> >BTW, does Java support UCS-4 ? If not, then Java is wrong
> >here ;-)
> Java claims to use unicode 2.1 [*]. I couldn't locate anything describing if
> this is UCS-2 or UTF-16. I think unicode 2.1 includes UCS-4. The actual
> level of support for UCS-4 is properly debatable.
> - The builtin char is 16bit wide and can obviously not support UCS-4.
> - The Character class can report if a character is a surrogate:
>     >>> from java.lang import Character
>     >>> Character.getType("\ud800") == Character.SURROGATE
>     1

>>> unicodedata.category(u'\ud800')

... which means the same thing only in Unicode3 standards

Make me think: perhaps we should add the Java constants to
unicodedata base. Is there a list of those available
somewhere ?

> - As reported, direct string comparison ignore surrogates.

I would guess that this'll have to change as soon as JavaSoft
folks realize that they need to handle UTF-16 and not only

> - The BreakIterator does not handle surrogates. It does handle
>   combining characters and it seems a natural place to put support
>   for surrogates.

What is a BreakIterator ? An iterator to scan line/word breaks ?

> - The Collator class offers different levels of normalization before
>   comparing string but does not seem to support surrogates. This class
>   seems a natural place for javasoft to put support for surrogates
>   during string comparison.

We'll need something like this for 2.1 too: first some
standard APIs for normalization and then a few unicmp()
APIs to use for sorting.

We might even have to add collation sequences somewhere because
this is a locale issue as well... sometimes it's even worse
with different strategies used for different tasks within one
locale, e.g. in Germany we sometimes sort the Umlaut  as "ae"
and at other times as extra character.
> These findings are gleaned from the sources of JDK1.3
> [*]
> http://java.sun.com/docs/books/vmspec/2nd-edition/html/Concepts.doc.html#25310

Thanks for the infos,
Marc-Andre Lemburg
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/