locale.py strangeness
Antti Kaihola
akaihola at ambi-no-spam-to-tone.com
Mon Jul 28 06:41:11 EDT 2003
Sun, 08 Jun 2003 16:44:25 +0200, Torsten Marek wrote:
> I experienced some strange behaviour with locale.py from Python 2.2.3.
So did I.
When using an LC_COLLATE locale (I've tested fi_FI and en_US),
locale.strcoll returns surprising (non-zero!) values when comparing
accented characters to the corresponding un-accented ones. Here's the
result of my test:
None fi_FI en_US needed
None ISO8859-1 ISO8859-1
strcoll(é, A) 1 4 4 4
strcoll(é, e) 1 [*] 1 [*] 3 0
strcoll(é, é) 0 0 0 0
strcoll(é, ö) -1 -24 -10 -10
strcoll(é, o) 1 -10 -10 -10
strcoll(é, z) 1 -21 -21 -21
strcoll(ö, A) 1 28 14 14
strcoll(ö, e) 1 24 10 10
strcoll(ö, é) 1 24 10 10
strcoll(ö, ö) 0 0 0 0
strcoll(ö, o) 1 14 [*] 9 0
strcoll(ö, z) 1 3 -11 -11
I've marked the strange lines with [*]. In the en_US locale,
why doesn't strcoll return zero when comparing accented characters
to the corresponding un-accented ones? The fi_FI locale should also
do that for "e acute".
The first column shows strcoll results before I touch the LC_COLLATE
locale. The next two columns show the results with the Finnish and
US English locales, and the last column is what I need for my
application.
Note that the distance from an accented and an un-accented character
to another character is identical (e.g. ö-e and ö-é in the table).
So, amazingly, the distances don't match:
>>> assert strcoll('X', 'i') - strcoll('X', 'j') == strcoll('j', 'i')
>>> assert strcoll('X', 'e') - strcoll('X', 'é') == strcoll('é', 'e')
Traceback (most recent call last):
File "<stdin>", line 1, in ?
AssertionError
Accented characters share the same distance to their un-accented cousins
as some other un-accented characters:
>>> from locale import * ; setlocale(LC_COLLATE, 'en_US')
'en_US'
>>> strcoll('é', 'e'), strcoll('h', 'e')
3 3
>>> strcoll('x', 'o'), strcoll('ö', 'o')
9 9
>>> setlocale(LC_COLLATE, 'fi_FI')
'fi_FI'
>>> strcoll('f', 'e'), strcoll('é', 'e')
1 1
See http://akaihola.iki.fi/comp/python/strcoll for the code, including a
work-around based on earlier discussions I've found on this newsgroup.
More information about the Python-list
mailing list