case-insensitive and internationalized sort

Thu Dec 19 16:26:08 EST 2002

> The fact that the built-in list sort method is case-sensitive seems to be a
> recurring topic. If a named parameter is going to be added to the sort
> method, it would probably require a PEP and discussion on python-dev before
> it was accepted. But since sort() doesn't do what a lot of people expect I
> would like to discuss the issues here first.

I disagree that sorting a list of strings doesn't do what most people 
expect; I also disagree that sorting in a case-insensitive manner would
be the right solution.

Sorting L produces an order such that, for every n, L[n] <= L[n+1]. I
think this is what most people expect when they talk about sorting.

So it is the comparison that you are unhappy with, not the sorting.
While the definition of string comparison may not be intuitive, the
lexicographical ordering that Python implements is really the only
sensible thing.

>   s.sort(lambda a, b: cmp(a.upper(), b.upper()))
> 
> Having a separate function may be easier to read, but are there speed
> differences or other trade-offs?

There won't be any difference between a a lambda function and a proper
function. Notice, however, that this computes L[n].upper() quite often,
so you may want to avoid the cost of uppercasing every string over and
over again. Why not make a list of pairs

s = [(x.upper(), x) for x in s]

It turns out that the builtin comparison of such tuples does what you
want:
   cmp ((x1upper, x1),(x2upper,x2)) is cmp(x1upper,x2upper) if this
   is nonzero, and is cmp(x1,x2) if both upper-case strings compare
   equal

So this will sort the list in a way that is case insensitive; if two
strings differ only in case, it will sort the strings in a 
case-sensitive manner.

> Now the next point is that it would be nice to be able to get a
> case-insensitive sort, which seems to be the most likely thing you want to
> do when sorting strings. If an optional casesensitive arg was added to
> sort() then without breaking any old code you could do:
>
> s.sort(casesensitive=False)

This won't fly. What if you are not sorting a list of strings, but, say,
a list of numbers?

> The final point is that the solution above still doesn't handle the
> characters a umlaut and A umlaut correctly. 

Define "correctly". Different languages sort accented characters in
different ways: either put them after all other letters (Swedish, I 
believe, and French for their accented characters), or sort them 
immediately after the unaccented form (German DIN sorting), or
sort them as-if transliterated (ie. sort Löwis as Loewis, old German 
phonebook sorting). Since we have two sorting orders in Germany for 
accented characters, I always forget how they work, and have to look in 
the dictionary to see what sorting procedure it uses.

In any case, if you want to use locale-aware sorting, you need to use 
the locale.strcoll function (use setlocale before using that function). 
Since strcoll is expensive, you can use locale.strxfrm on each string
to transform it into a byte string for which lexicographical sorting
honors collation order.

> Anyone up for a casesensitive flag addition to the sort() method?

Please, no.

Notice that strcoll and strxfrm aren't the most recent approach to
string collation: If you use Unicode and the IBM ICU library, they'll
offer extensive collation support where you can chose between various 
collation conventions at run-time.

Regards,
Martin