[I18n-sig] possible bug in my UCA implementation
jtauber at jtauber.com
Mon Feb 13 05:02:50 CET 2006
I discovered I wasn't converting to NFD first but this doesn't solve
the problem, it just explains why expansions were used.
Even using NFD, I get the same result.
On 30/01/2006, at 4:35 PM, James Tauber wrote:
> My Python Unicode Collation Algorithm implementation is giving
> unexpected results that could be because of:
> 1. a bug in my code
> 2. a bug in the DUCET
> 3. a difference of opinion between the way I think Ancient Greek
> should be collated and the way DUCET thinks so
> I'd like to get the opinion of some of you who are more familiar
> with UCA (and perhaps can try my example out on ICU)
> For the purposes of testing, say I'm trying to sort the three words:
> (1) ᾅδης
> (2) Ἄβελ
> (3) ἀββά
> In my view they should be sorted in the reverse to what they are
> now, but my pyuca code sorts them in the order listed above.
> pyuca assigns the words the following sort keys:
> (1) ['0x124e', '0x0', '0x0', '0x0', '0x1252', '0x1257', '0x126a',
> '0x0', '0x20', '0x2a', '0x32', '0x97', '0x20', '0x20', '0x20',
> '0x0', '0x2', '0x2', '0x2', '0x2', '0x2', '0x2', '0x19', '0x0',
> '0x3b1', '0x314', '0x301', '0x345', '0x3b4', '0x3b7', '0x3c2']
> (2) ['0x124e', '0x0', '0x0', '0x124f', '0x1253', '0x125c', '0x0',
> '0x20', '0x22', '0x32', '0x20', '0x20', '0x20', '0x0', '0x8',
> '0x2', '0x2', '0x2', '0x2', '0x2', '0x0', '0x391', '0x313',
> '0x301', '0x3b2', '0x3b5', '0x3bb']
> (3) ['0x124e', '0x0', '0x124f', '0x124f', '0x124e', '0x0', '0x0',
> '0x20', '0x22', '0x20', '0x20', '0x20', '0x32', '0x0', '0x2',
> '0x2', '0x2', '0x2', '0x2', '0x2', '0x0', '0x3b1', '0x313',
> '0x3b2', '0x3b2', '0x3b1', '0x301']
> The problem is that ᾅ (the first character of (1)) expands to 4
> collation elements, Ἄ (the first character of (2)) to 3 and ἀ
> (the first character of (3)) to 2 and as a result and, because all
> but the first element is zero, they are comparing less, just by
> virtue of having more collation elements.
> I don't even understand why these letters are being treated as
> expansions rather than simply taking advantage of the secondary and
> tertiary levels, but sure enough that is how the DUCET describes them.
> Am I missing something fundamental in the algorithm? Or is it
> possible the DUCET is wrong?
More information about the I18n-sig