case-insensitive and internationalized sort
"Martin v. Löwis"
martin at v.loewis.de
Thu Dec 19 16:26:08 EST 2002
> The fact that the built-in list sort method is case-sensitive seems to be a
> recurring topic. If a named parameter is going to be added to the sort
> method, it would probably require a PEP and discussion on python-dev before
> it was accepted. But since sort() doesn't do what a lot of people expect I
> would like to discuss the issues here first.
I disagree that sorting a list of strings doesn't do what most people
expect; I also disagree that sorting in a case-insensitive manner would
be the right solution.
Sorting L produces an order such that, for every n, L[n] <= L[n+1]. I
think this is what most people expect when they talk about sorting.
So it is the comparison that you are unhappy with, not the sorting.
While the definition of string comparison may not be intuitive, the
lexicographical ordering that Python implements is really the only
sensible thing.
> s.sort(lambda a, b: cmp(a.upper(), b.upper()))
>
> Having a separate function may be easier to read, but are there speed
> differences or other trade-offs?
There won't be any difference between a a lambda function and a proper
function. Notice, however, that this computes L[n].upper() quite often,
so you may want to avoid the cost of uppercasing every string over and
over again. Why not make a list of pairs
s = [(x.upper(), x) for x in s]
It turns out that the builtin comparison of such tuples does what you
want:
cmp ((x1upper, x1),(x2upper,x2)) is cmp(x1upper,x2upper) if this
is nonzero, and is cmp(x1,x2) if both upper-case strings compare
equal
So this will sort the list in a way that is case insensitive; if two
strings differ only in case, it will sort the strings in a
case-sensitive manner.
> Now the next point is that it would be nice to be able to get a
> case-insensitive sort, which seems to be the most likely thing you want to
> do when sorting strings. If an optional casesensitive arg was added to
> sort() then without breaking any old code you could do:
>
> s.sort(casesensitive=False)
This won't fly. What if you are not sorting a list of strings, but, say,
a list of numbers?
> The final point is that the solution above still doesn't handle the
> characters a umlaut and A umlaut correctly.
Define "correctly". Different languages sort accented characters in
different ways: either put them after all other letters (Swedish, I
believe, and French for their accented characters), or sort them
immediately after the unaccented form (German DIN sorting), or
sort them as-if transliterated (ie. sort Löwis as Loewis, old German
phonebook sorting). Since we have two sorting orders in Germany for
accented characters, I always forget how they work, and have to look in
the dictionary to see what sorting procedure it uses.
In any case, if you want to use locale-aware sorting, you need to use
the locale.strcoll function (use setlocale before using that function).
Since strcoll is expensive, you can use locale.strxfrm on each string
to transform it into a byte string for which lexicographical sorting
honors collation order.
> Anyone up for a casesensitive flag addition to the sort() method?
Please, no.
Notice that strcoll and strxfrm aren't the most recent approach to
string collation: If you use Unicode and the IBM ICU library, they'll
offer extensive collation support where you can chose between various
collation conventions at run-time.
Regards,
Martin
More information about the Python-list
mailing list