[Python-Dev] UTF-16 code point comparison

M.-A. Lemburg mal@lemburg.com
Thu, 27 Jul 2000 10:22:31 +0200

Finn Bock wrote:
> CPythons unicode compare function contains some code to compare surrogate
> characters in code-point order (I think). This is properly a very neat
> feature but is differs from java's way of comparing strings.
>   Python 2.0b1 (#0, Jul 26 2000, 21:29:11) [MSC 32 bit (Intel)] on win32
>   Copyright 1991-1995 Stichting Mathematisch Centrum, Amsterdam
>   Copyright 1995-2000 Corporation for National Research Initiatives (CNRI)
>   >>> print u'\ue000' < u'\ud800'
>   1
>   >>> print ord(u'\ue000') < ord(u'\ud800')
>   0
>   >>>
> Java (and JPython) compares the 16-bit characters numericly which result in:
>   JPython 1.1+08 on java1.3.0 (JIT: null)
>   Copyright (C) 1997-1999 Corporation for National Research Initiatives
>   >>> print u'\ue000' < u'\ud800'
>   0
>   >>> print ord(u'\ue000') < ord(u'\ud800')
>   0
>   >>>
> I don't think I can come up with a solution that allow JPython to emulate
> CPython on this type of comparison.

The code originally worked the same way as what Java does
here. Bill Tutt then added ideas from some IBM Java lib
which turns the UTF-16 comparison into a true UCS-4 comparison.

This really has nothing to do with being able to support
surrogates or not (as Fredrik mentioned), it is the correct
behaviour provided UTF-16 is used as encoding for UCS-4 values
in Unicode literals which is what Python currently does.

BTW, does Java support UCS-4 ? If not, then Java is wrong
here ;-)

Comparing Unicode strings is not as trivial as one might
think: private point areas introduce a great many possibilities
of getting things wrong and the fact that many characters
can be expressed by combining other characters adds to the
confusion. E.g. for sorting, we'd need full normalization
support for Unicode and would have to come up with some
smart strategy to handle private code point areas.

All this is highly non-trivial and will probably not get
implemented for a while (other issues are more important
right now, e.g. getting the internals to use the default
encoding instead of UTF-8).

For now I'd suggest leaving Bill's code activated because
it does the right thing for Python's Unicode implementation
(which is built upon UTF-16).

Marc-Andre Lemburg
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/