[I18n-sig] Re: Unicode surrogates: just say no!
Gaute B Strokkenes
gs234@cam.ac.uk
27 Jun 2001 00:30:00 +0100
On 27 Jun 2001, gs234@cam.ac.uk wrote:
>
> 1) Sort order. Unicode strings should sort in Unicode
> lexicographical order. With UCS-4 this is easy; just compare the
> Py_UNICODE values one by one like C does with strcmp(). With
> UTF-16 this is more complicated when surrogates get involved.
> Basically, you go through the strings being compared until you
> find the first difference. If both characters at this point are
> in the BMP or both are high surrogates, just compare them as
> usual. However, if one is in the BMP and the other is a
> surrogate, you need to make sure that the string with the
> surrogate in it sorts after the one with the BMP character.
> Straight comparison won't work since there are characters in the
> BMP with numerical values greater than those of surrogates.
Speaking of the devil indeed: mere seconds after I sent this, the
following was posted to the unicode list:
On Tue, 26 Jun 2001, mark@macchiato.com wrote:
> I asked our performance czar to run a test comparing the performance
> of the two ICU utf-16 strcmp routines (UTF-16 binary order and
> UTF-8/32 binary order). While I want to caution that the results are
> preliminary, here they are:
>
> "Test File u_strcmp u_strcmpCodePointOrder
> ---------------------------------------------------
> Asian Names 81 ns 83 ns / call
> Latin Names 127 ns 124 ns
>
>
> The test is a binary search of a sorted list of roughly 10000 names.
> The Asian names are quite a bit shorter, which probably accounts for
> the time difference between them and the Latin names.
>
> The code path through the u_strcmpCodePointOrder function has
> (statistically, anyhow) exactly one added simple if relative to
> u_strcmp. The timing differences are repeatable on my machine, but
> are probably mostly noise from code alignment and the like..."
--
Big Gaute http://www.srcf.ucam.org/~gs234/
How's it going in those MODULAR LOVE UNITS??