locale.py strangeness

Antti Kaihola akaihola at ambi-no-spam-to-tone.com
Mon Jul 28 06:41:11 EDT 2003


Sun, 08 Jun 2003 16:44:25 +0200, Torsten Marek wrote:
> I experienced some strange behaviour with locale.py from Python 2.2.3.

So did I.

When using an LC_COLLATE locale (I've tested fi_FI and en_US),
locale.strcoll returns surprising (non-zero!) values when comparing
accented characters to the corresponding un-accented ones.  Here's the
result of my test:

              None     fi_FI     en_US needed
              None ISO8859-1 ISO8859-1       
strcoll(é, A)    1         4         4      4
strcoll(é, e)    1     [*] 1     [*] 3      0
strcoll(é, é)    0         0         0      0
strcoll(é, ö)   -1       -24       -10    -10
strcoll(é, o)    1       -10       -10    -10
strcoll(é, z)    1       -21       -21    -21
strcoll(ö, A)    1        28        14     14
strcoll(ö, e)    1        24        10     10
strcoll(ö, é)    1        24        10     10
strcoll(ö, ö)    0         0         0      0
strcoll(ö, o)    1        14     [*] 9      0 
strcoll(ö, z)    1         3       -11    -11

I've marked the strange lines with [*].  In the en_US locale,
why doesn't strcoll return zero when comparing accented characters
to the corresponding un-accented ones?  The fi_FI locale should also
do that for "e acute".

The first column shows strcoll results before I touch the LC_COLLATE
locale.  The next two columns show the results with the Finnish and
US English locales, and the last column is what I need for my
application.

Note that the distance from an accented and an un-accented character
to another character is identical (e.g. ö-e and ö-é in the table).
So, amazingly, the distances don't match:
>>> assert strcoll('X', 'i') - strcoll('X', 'j') == strcoll('j', 'i')
>>> assert strcoll('X', 'e') - strcoll('X', 'é') == strcoll('é', 'e')
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
AssertionError

Accented characters share the same distance to their un-accented cousins
as some other un-accented characters:
>>> from locale import * ; setlocale(LC_COLLATE, 'en_US')
'en_US'
>>> strcoll('é', 'e'), strcoll('h', 'e')
3 3
>>> strcoll('x', 'o'), strcoll('ö', 'o')
9 9
>>> setlocale(LC_COLLATE, 'fi_FI')
'fi_FI'
>>> strcoll('f', 'e'), strcoll('é', 'e')
1 1


See http://akaihola.iki.fi/comp/python/strcoll for the code, including a
work-around based on earlier discussions I've found on this newsgroup.





More information about the Python-list mailing list