localizing a sort

Peter Otten __peter__ at web.de
Sun Sep 2 05:18:00 EDT 2007


Am Sat, 01 Sep 2007 18:56:38 -0300 schrieb Ricardo Aráoz:

> Hi, I've been working on sorting out some words.
> 
> My locale is :
>>>> import locale
>>>> locale.getdefaultlocale()
> ('es_AR', 'cp1252')
> 
> I do :
>>>> a = 'áéíóúäëïöüàèìòù'
>>>> print ''.join(sorted(a, cmp=lambda x,y: locale.strcoll(x,y)))
> aeiouàáäèéëìíïòóöùúü

The lambda is superfluous. Just write cmp=locale.strcoll instead.
 
> This is not what I am expecting. I was expecting :
> aáàäeéèëiíìï.....etc.
> 
> The reason is that if you want to order some words (say for a dictionary
> (paper dict, where you look up words)) this is what happens :
>>>> a = 'palàbra de pàlabra de pblabra'
>>>> print ' '.join(sorted(a.split(), cmp=lambda x,y: locale.strcoll(x, y)))
> de de palàbra pblabra pàlabra
> 
> While any human being would expect :
> 
> de de palàbra pàlabra pblabra
> 
> Does anybody know a way in which I could get the desired output?

I suppose it would work on your machine if you set the locale first with

>>> locale.setlocale(locale.LC_ALL, "")
'de_DE.UTF-8'

I have to resort to a list instead of a string on mine because it uses the 
UTF-8 encoding where one character may consist of more than one byte.
(Providing key is more efficient than cmp.)

>>> a = ['á', 'é', 'í', 'ó', 'ú', 'ä', 'ë', 'ï', 'ö', 'ü', 'à', 'è', 'ì', 'ò', 'ù', 'a', 'e', 'i', 'o', 'u']
>>> print "".join(sorted(a, key=locale.strxfrm))
aáàäeéèëiíìïoóòöuúùü

However, to make your program a bit more portable I recommend that you 
use unicode instead of str:

>>> import locale
>>> locale.setlocale(locale.LC_ALL, "")
'de_DE.UTF-8'
>>> encoding = locale.getlocale()[1]
>>> def sortkey(s):
...     return locale.strxfrm(s.encode(encoding))
... 
>>> print "".join(sorted(u"áéíóúäëïöüàèìòùaeiou", key=sortkey))
aáàäeéèëiíìïoóòöuúùü
>>> 

Peter



More information about the Python-list mailing list