[Pythonmac-SIG] Accented characters

Quarante-Deux xlii at xlii.org
Fri Jul 16 19:01:18 CEST 2004


At 22:07 +0200 13.07.2004, Jack Jansen wrote:
>I knew there was a solution to this, but it wasn't easy to find:-)
>
>But I did find it: see=20
><http://www.python.org/peps/pep-0263.html> for=20
>an explanation on how to state the encoding of=20
>your sourcefile. And the easiest way to do it is=20
>to save your file as UTF-8 and with BOM marks. I=20
>think BBEdit has an option to save files in this=20
>format.

Thank you.

I now can get regular expressions to correctly recognize word boundaries.

However, sorting is another problem. No way I can=20
get accented caracters to sort correctly using=20
locale.strcoll. Correctly meaning that the=20
accented character either sorts directly after=20
the non-accented one, or is considered =3D to the=20
non-accented one. All I can get, with=20
fr_FR.ISO8859-15 is that all the accented=20
characters sort before all the other characters.=20
If I do a plain sort (no locale), or with=20
fr_FR.UTF-8 the accented characters sort after=20
all the others. Neither is of any use to me.

Here's a sample output to a file (the source has=20
a list with =8E (e acute) and =88 (a grave))

oldloc: C
newloc: fr_FR.ISO8859-15

before sorting:
['a', 'w', 'f', '\xc3\xa9', 'b', '\xc3\xa0', 'd']
['a', 'z', 'f', '\xc3\xa9', 'b', '\xc3\xa0', 'd']

after sorting with locale.strcoll:
['\xc3\xa0', '\xc3\xa9', 'a', 'b', 'd', 'f', 'w']
after sorting without locale.strcoll:
['a', 'b', 'd', 'f', 'z', '\xc3\xa0', '\xc3\xa9']

oldloc: C
newloc: fr_FR.UTF-8

before sorting:
['a', 'w', 'f', '\xc3\xa9', 'b', '\xc3\xa0', 'd']
['a', 'z', 'f', '\xc3\xa9', 'b', '\xc3\xa0', 'd']

after sorting with locale.strcoll:
['a', 'b', 'd', 'f', 'w', '\xc3\xa0', '\xc3\xa9']
after sorting without locale.strcoll:
['a', 'b', 'd', 'f', 'z', '\xc3\xa0', '\xc3\xa9']

The problem seems to be in the FreeBSD=20
implementation of the C libraries LC_COLLATE.=20
Because the problem is identical on MacOSX and on=20
=46reeBSD 5.2.1. But on Linux, the sorting order is=20
good.

I can't see any reason for this. But I found this
http://akaihola.iki.fi/comp/python/strcoll

Apparently, I'm not the only one with sorting=20
problems. I figure I'll have to roll my own :-)=20
unless there's something I missed somewhere.

Ellen

-- 
-------------------------------------------------------------------
xlii at xlii.org                          |  Ellen C. Herzfeld
http://www.quarante-deux.org/          |  Dominique O. Martel
Quelques pages sur la Science-Fiction  |  Quarante-Deux
-------------------------------------------------------------------


More information about the Pythonmac-SIG mailing list