[Python-Dev] UTF-16 code point comparison
Fri, 28 Jul 2000 05:15:17 GMT
>Finn Bock wrote:
>> [M.-A. Lemburg]
>> >BTW, does Java support UCS-4 ? If not, then Java is wrong
>> >here ;-)
>> Java claims to use unicode 2.1 [*]. I couldn't locate anything describing if
>> this is UCS-2 or UTF-16. I think unicode 2.1 includes UCS-4. The actual
>> level of support for UCS-4 is properly debatable.
>> - The builtin char is 16bit wide and can obviously not support UCS-4.
>> - The Character class can report if a character is a surrogate:
>> >>> from java.lang import Character
>> >>> Character.getType("\ud800") == Character.SURROGATE
>... which means the same thing only in Unicode3 standards
>Make me think: perhaps we should add the Java constants to
>unicodedata base. Is there a list of those available
UNASSIGNED = 0
>> - As reported, direct string comparison ignore surrogates.
>I would guess that this'll have to change as soon as JavaSoft
>folks realize that they need to handle UTF-16 and not only
Predicting the future can be difficult, but here is my take:
javasoft will never change the way String.compareTo works.
String.compareTo is documented as:
Compares two strings lexicographically. The comparison is based on
the Unicode value of each character in the strings. ...
Instead they will mark it as a very naive string comparison and suggest
users to use the Collator classes for anything but the simplest cases.
>> - The BreakIterator does not handle surrogates. It does handle
>> combining characters and it seems a natural place to put support
>> for surrogates.
>What is a BreakIterator ? An iterator to scan line/word breaks ?
Yes, as well as character breaks. It already contains the framework for
marking two chars next to each other as one.
>> - The Collator class offers different levels of normalization before
>> comparing string but does not seem to support surrogates. This class
>> seems a natural place for javasoft to put support for surrogates
>> during string comparison.
>We'll need something like this for 2.1 too: first some
>standard APIs for normalization and then a few unicmp()
>APIs to use for sorting.
>We might even have to add collation sequences somewhere because
>this is a locale issue as well... sometimes it's even worse
>with different strategies used for different tasks within one
>locale, e.g. in Germany we sometimes sort the Umlaut ä as "ae"
>and at other times as extra character.
Info: The java Collator class is configured with
- a locale and
- a strengh parameter
IDENTICAL; all difference are significant.
PRIMARY (a vs b)
SECONDARY (a vs ä)
TERTIARY (a vs A)
- a decomposition (http://www.unicode.org/unicode/reports/tr15/)