[I18n-sig] possible bug in my UCA implementation

Mon Feb 13 05:44:49 CET 2006

Okay, I've fixed the problem. It did turn out to be a bug in my  
implementation.

The core of the algorithm was implemented as

	for level in range(4):
		if level:
                 	sort_key.append(0) # level separator
		for element in collation_elements:
			ce_l = int(element[1][level], 16)
			sort_key.append(ce_l)

instead of

	for level in range(4):
		if level:
                 	sort_key.append(0) # level separator
		for element in collation_elements:
			ce_l = int(element[1][level], 16)
			if ce_l:
				sort_key.append(ce_l)

In  other words, the UCA itself makes it clear (Step S3.5) that only  
non-zero CE_L should be appended.

I'll update the code on my website shortly.

James

On 13/02/2006, at 12:02 PM, James Tauber wrote:

>
> I discovered I wasn't converting to NFD first but this doesn't solve
> the problem, it just explains why expansions were used.
>
> Even using NFD, I get the same result.
>
> James
>
> On 30/01/2006, at 4:35 PM, James Tauber wrote:
>
>>
>> My Python Unicode Collation Algorithm implementation is giving
>> unexpected results that could be because of:
>>
>> 1. a bug in my code
>> 2. a bug in the DUCET
>> 3. a difference of opinion between the way I think Ancient Greek
>> should be collated and the way DUCET thinks so
>>
>> I'd like to get the opinion of some of you who are more familiar
>> with UCA (and perhaps can try my example out on ICU)
>>
>> For the purposes of testing, say I'm trying to sort the three words:
>>
>> (1)	ᾅδης
>> (2)	Ἄβελ
>> (3)	ἀββά
>>
>> In my view they should be sorted in the reverse to what they are
>> now, but my pyuca code sorts them in the order listed above.
>>
>> pyuca assigns the words the following sort keys:
>>
>> (1) ['0x124e', '0x0', '0x0', '0x0', '0x1252', '0x1257', '0x126a',
>> '0x0', '0x20', '0x2a', '0x32', '0x97', '0x20', '0x20', '0x20',
>> '0x0', '0x2', '0x2', '0x2', '0x2', '0x2', '0x2', '0x19', '0x0',
>> '0x3b1', '0x314', '0x301', '0x345', '0x3b4', '0x3b7', '0x3c2']
>> (2) ['0x124e', '0x0', '0x0', '0x124f', '0x1253', '0x125c', '0x0',
>> '0x20', '0x22', '0x32', '0x20', '0x20', '0x20', '0x0', '0x8',
>> '0x2', '0x2', '0x2', '0x2', '0x2', '0x0', '0x391', '0x313',
>> '0x301', '0x3b2', '0x3b5', '0x3bb']
>> (3) ['0x124e', '0x0', '0x124f', '0x124f', '0x124e', '0x0', '0x0',
>> '0x20', '0x22', '0x20', '0x20', '0x20', '0x32', '0x0', '0x2',
>> '0x2', '0x2', '0x2', '0x2', '0x2', '0x0', '0x3b1', '0x313',
>> '0x3b2', '0x3b2', '0x3b1', '0x301']
>>
>> The problem is that ᾅ (the first character of (1)) expands to 4
>> collation elements, Ἄ (the first character of (2)) to 3 and ἀ
>> (the first character of (3)) to 2 and as a result and, because all
>> but the first element is zero, they are comparing less, just by
>> virtue of having more collation elements.
>>
>> I don't even understand why these letters are being treated as
>> expansions rather than simply taking advantage of the secondary and
>> tertiary levels, but sure enough that is how the DUCET describes  
>> them.
>>
>> Am I missing something fundamental in the algorithm? Or is it
>> possible the DUCET is wrong?
>>
>> James
>
> _______________________________________________
> I18n-sig mailing list
> I18n-sig at python.org
> http://mail.python.org/mailman/listinfo/i18n-sig

--
James Tauber                       http://jtauber.com/
journeyman of some   http://jtauber.com/blog/