[I18n-sig] possible bug in my UCA implementation
James Tauber
jtauber at jtauber.com
Mon Jan 30 09:35:28 CET 2006
My Python Unicode Collation Algorithm implementation is giving
unexpected results that could be because of:
1. a bug in my code
2. a bug in the DUCET
3. a difference of opinion between the way I think Ancient Greek
should be collated and the way DUCET thinks so
I'd like to get the opinion of some of you who are more familiar with
UCA (and perhaps can try my example out on ICU)
For the purposes of testing, say I'm trying to sort the three words:
(1) ᾅδης
(2) Ἄβελ
(3) ἀββά
In my view they should be sorted in the reverse to what they are now,
but my pyuca code sorts them in the order listed above.
pyuca assigns the words the following sort keys:
(1) ['0x124e', '0x0', '0x0', '0x0', '0x1252', '0x1257', '0x126a',
'0x0', '0x20', '0x2a', '0x32', '0x97', '0x20', '0x20', '0x20', '0x0',
'0x2', '0x2', '0x2', '0x2', '0x2', '0x2', '0x19', '0x0', '0x3b1',
'0x314', '0x301', '0x345', '0x3b4', '0x3b7', '0x3c2']
(2) ['0x124e', '0x0', '0x0', '0x124f', '0x1253', '0x125c', '0x0',
'0x20', '0x22', '0x32', '0x20', '0x20', '0x20', '0x0', '0x8', '0x2',
'0x2', '0x2', '0x2', '0x2', '0x0', '0x391', '0x313', '0x301',
'0x3b2', '0x3b5', '0x3bb']
(3) ['0x124e', '0x0', '0x124f', '0x124f', '0x124e', '0x0', '0x0',
'0x20', '0x22', '0x20', '0x20', '0x20', '0x32', '0x0', '0x2', '0x2',
'0x2', '0x2', '0x2', '0x2', '0x0', '0x3b1', '0x313', '0x3b2',
'0x3b2', '0x3b1', '0x301']
The problem is that ᾅ (the first character of (1)) expands to 4
collation elements, Ἄ (the first character of (2)) to 3 and ἀ (the
first character of (3)) to 2 and as a result and, because all but the
first element is zero, they are comparing less, just by virtue of
having more collation elements.
I don't even understand why these letters are being treated as
expansions rather than simply taking advantage of the secondary and
tertiary levels, but sure enough that is how the DUCET describes them.
Am I missing something fundamental in the algorithm? Or is it
possible the DUCET is wrong?
James
More information about the I18n-sig
mailing list