[I18n-sig] Re: Unicode surrogates: just say no!

Gaute B Strokkenes gs234@cam.ac.uk
27 Jun 2001 00:30:00 +0100

On 27 Jun 2001, gs234@cam.ac.uk wrote:
> 1) Sort order.  Unicode strings should sort in Unicode
>    lexicographical order.  With UCS-4 this is easy; just compare the
>    Py_UNICODE values one by one like C does with strcmp().  With
>    UTF-16 this is more complicated when surrogates get involved.
>    Basically, you go through the strings being compared until you
>    find the first difference.  If both characters at this point are
>    in the BMP or both are high surrogates, just compare them as
>    usual.  However, if one is in the BMP and the other is a
>    surrogate, you need to make sure that the string with the
>    surrogate in it sorts after the one with the BMP character.
>    Straight comparison won't work since there are characters in the
>    BMP with numerical values greater than those of surrogates.

Speaking of the devil indeed: mere seconds after I sent this, the
following was posted to the unicode list:

On Tue, 26 Jun 2001, mark@macchiato.com wrote:
> I asked our performance czar to run a test comparing the performance
> of the two ICU utf-16 strcmp routines (UTF-16 binary order and
> UTF-8/32 binary order). While I want to caution that the results are
> preliminary, here they are:
> "Test File       u_strcmp     u_strcmpCodePointOrder 
> --------------------------------------------------- 
> Asian Names       81 ns        83 ns / call 
> Latin Names      127 ns       124 ns 
> The test is a binary search of a sorted list of roughly 10000 names.
> The Asian names are quite a bit shorter, which probably accounts for
> the time difference between them and the Latin names.
> The code path through the u_strcmpCodePointOrder function has
> (statistically, anyhow) exactly one added simple if relative to
> u_strcmp.  The timing differences are repeatable on my machine, but
> are probably mostly noise from code alignment and the like..."

Big Gaute                               http://www.srcf.ucam.org/~gs234/
How's it going in those MODULAR LOVE UNITS??