[I18n-sig] Re: Unicode surrogates: just say no!
Guido van Rossum
guido@digicool.com
Tue, 26 Jun 2001 19:34:16 -0400
> 1) Sort order. Unicode strings should sort in Unicode lexicographical
> order. With UCS-4 this is easy; just compare the Py_UNICODE values
> one by one like C does with strcmp(). With UTF-16 this is more
> complicated when surrogates get involved. Basically, you go
> through the strings being compared until you find the first
> difference. If both characters at this point are in the BMP or
> both are high surrogates, just compare them as usual. However, if
> one is in the BMP and the other is a surrogate, you need to make
> sure that the string with the surrogate in it sorts after the one
> with the BMP character. Straight comparison won't work since there
> are characters in the BMP with numerical values greater than those
> of surrogates.
>
> I believe that this is the right thing to do when Py_UNICODE is
> UCS-2 since the added complexity is only O(1) per string comparison
> and is very easy to implement. This will ensure that
> cmp(unichr(0xFFFD), unichr(0x10ABCD)) will work consistently and
> correctly for both UCS-2 and UCS-4.
I'm neutral on this one; on the one hand I think we should minimize
the surrogate support outside the codecs, on the other hand this makes
some sense.
> 2) There is an incompatibility between the two approaches since
> unichr(high surrogate) + unichr(low surrogate) will magically be
> the same as unichr(the approriate astral codepoint) when UCS-2 is
> used. With UCS-4 they will not; it will result in a string with
> two values that have no well-defined meaning.
>
> I don't think this is a show-stopper, but people will need to be
> made aware.
Agreed.
--Guido van Rossum (home page: http://www.python.org/~guido/)