[Python-Dev] UTF-16 code point comparison
M.-A. Lemburg
mal@lemburg.com
Thu, 27 Jul 2000 22:22:09 +0200
Finn Bock wrote:
>
> [M.-A. Lemburg]
>
> >BTW, does Java support UCS-4 ? If not, then Java is wrong
> >here ;-)
>
> Java claims to use unicode 2.1 [*]. I couldn't locate anything describing if
> this is UCS-2 or UTF-16. I think unicode 2.1 includes UCS-4. The actual
> level of support for UCS-4 is properly debatable.
>
> - The builtin char is 16bit wide and can obviously not support UCS-4.
> - The Character class can report if a character is a surrogate:
> >>> from java.lang import Character
> >>> Character.getType("\ud800") == Character.SURROGATE
> 1
>>> unicodedata.category(u'\ud800')
'Cs'
... which means the same thing only in Unicode3 standards
notation.
Make me think: perhaps we should add the Java constants to
unicodedata base. Is there a list of those available
somewhere ?
> - As reported, direct string comparison ignore surrogates.
I would guess that this'll have to change as soon as JavaSoft
folks realize that they need to handle UTF-16 and not only
UCS-2.
> - The BreakIterator does not handle surrogates. It does handle
> combining characters and it seems a natural place to put support
> for surrogates.
What is a BreakIterator ? An iterator to scan line/word breaks ?
> - The Collator class offers different levels of normalization before
> comparing string but does not seem to support surrogates. This class
> seems a natural place for javasoft to put support for surrogates
> during string comparison.
We'll need something like this for 2.1 too: first some
standard APIs for normalization and then a few unicmp()
APIs to use for sorting.
We might even have to add collation sequences somewhere because
this is a locale issue as well... sometimes it's even worse
with different strategies used for different tasks within one
locale, e.g. in Germany we sometimes sort the Umlaut ä as "ae"
and at other times as extra character.
> These findings are gleaned from the sources of JDK1.3
>
> [*]
> http://java.sun.com/docs/books/vmspec/2nd-edition/html/Concepts.doc.html#25310
>
Thanks for the infos,
--
Marc-Andre Lemburg
______________________________________________________________________
Business: http://www.lemburg.com/
Python Pages: http://www.lemburg.com/python/