case-insensitive and internationalized sort

Thu Dec 19 20:10:23 EST 2002

The case-insensitive sort code below was added to the Python Cookbook:

http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/170242

ka

"Kevin Altis" <altis at semi-retired.com> wrote in message
news:MGrM9.826$tO3.92174 at news.uswest.net...
> "Martin v. Löwis" <martin at v.loewis.de> wrote in message
> news:attdhe$bot$05$1 at news.t-online.com...
> > > The fact that the built-in list sort method is case-sensitive seems to
> be a
> > > recurring topic. If a named parameter is going to be added to the sort
> > > method, it would probably require a PEP and discussion on python-dev
> before
> > > it was accepted. But since sort() doesn't do what a lot of people
expect
> I
> > > would like to discuss the issues here first.
> >
> > I disagree that sorting a list of strings doesn't do what most people
> > expect; I also disagree that sorting in a case-insensitive manner would
> > be the right solution.
> >
> > Sorting L produces an order such that, for every n, L[n] <= L[n+1]. I
> > think this is what most people expect when they talk about sorting.
> >
> > So it is the comparison that you are unhappy with, not the sorting.
> > While the definition of string comparison may not be intuitive, the
> > lexicographical ordering that Python implements is really the only
> > sensible thing.
>
> Correct, it is the compare that I take issue with. I've been looking
through
> my code and I can't find any instances where when comparing strings I was
> interested in an ordinal value comparison. Whether it is filenames or
lists
> in HTML, lists of strings for a UI, etc. ignoring case is what the end
user
> expects to see when dealing with strings. Sorting strings with the
default:
>
> >>> s = ['c', 'C', 'b', 'a', 'A', 'B']
> >>> s.sort()
> >>> s
> ['A', 'B', 'C', 'a', 'b', 'c']
>
> is just so 1970s. ;-)
>
> > >   s.sort(lambda a, b: cmp(a.upper(), b.upper()))
> > >
> > > Having a separate function may be easier to read, but are there speed
> > > differences or other trade-offs?
> >
> > There won't be any difference between a a lambda function and a proper
> > function. Notice, however, that this computes L[n].upper() quite often,
> > so you may want to avoid the cost of uppercasing every string over and
> > over again. Why not make a list of pairs
> >
> > s = [(x.upper(), x) for x in s]
> >
> > It turns out that the builtin comparison of such tuples does what you
> > want:
> >    cmp ((x1upper, x1),(x2upper,x2)) is cmp(x1upper,x2upper) if this
> >    is nonzero, and is cmp(x1,x2) if both upper-case strings compare
> >    equal
> >
> > So this will sort the list in a way that is case insensitive; if two
> > strings differ only in case, it will sort the strings in a
> > case-sensitive manner.
>
> That looks good, but I'm a bit slow. I'm not seeing how to create just a
> compare function so I can sort in place. This is the closest I could I do
> and it requires a standalone function and the temporary creation of
another
> list:
>
> >>> def caseinsensitive(list):
> ...     list = [(x.upper(), x) for x in list]
> ...     list.sort()
> ...     return [x[1] for x in list]
> ...
> >>> s = [u'ö', u'ä', u'Ä', 'b', 'a', 'B', u'a', 'A']
> >>> s = caseinsensitive(s)
> >>> s
> ['A', 'a', u'a', 'B', 'b', u'\xc4', u'\xe4', u'\xf6']
>
> Is there a simpler way I'm just not seeing one that avoids a lot of extra
> upper() calls? The list creation overhead may be worse than extra upper
> calls. I didn't have any succes trying to incorporate locale.strcoll or
> locale.strxfrm. I kept getting an exception
>
> UnicodeError: ASCII encoding error: ordinal not in range(128)
>
> So, some other string encoding transformation must be necessary before
> list.sort() is called in the function? Those functions were new to me, but
> they sound like what I want since locale-aware sorting would be a good
> addition.
>
> > > Now the next point is that it would be nice to be able to get a
> > > case-insensitive sort, which seems to be the most likely thing you
want
> to
> > > do when sorting strings. If an optional casesensitive arg was added to
> > > sort() then without breaking any old code you could do:
> > >
> > > s.sort(casesensitive=False)
> >
> > This won't fly. What if you are not sorting a list of strings, but, say,
> > a list of numbers?
>
> I retract the suggestion since it is the default cmp when applied to
strings
> that I take issue with.
>
> > > The final point is that the solution above still doesn't handle the
> > > characters a umlaut and A umlaut correctly.
> >
> > Define "correctly". Different languages sort accented characters in
> > different ways: either put them after all other letters (Swedish, I
> > believe, and French for their accented characters), or sort them
> > immediately after the unaccented form (German DIN sorting), or
> > sort them as-if transliterated (ie. sort Löwis as Loewis, old German
> > phonebook sorting). Since we have two sorting orders in Germany for
> > accented characters, I always forget how they work, and have to look in
> > the dictionary to see what sorting procedure it uses.
> >
> > In any case, if you want to use locale-aware sorting, you need to use
> > the locale.strcoll function (use setlocale before using that function).
> > Since strcoll is expensive, you can use locale.strxfrm on each string
> > to transform it into a byte string for which lexicographical sorting
> > honors collation order.
> >
> > > Anyone up for a casesensitive flag addition to the sort() method?
> >
> > Please, no.
> >
> > Notice that strcoll and strxfrm aren't the most recent approach to
> > string collation: If you use Unicode and the IBM ICU library, they'll
> > offer extensive collation support where you can chose between various
> > collation conventions at run-time.
> >
> > Regards,
> > Martin
>
> Thanks Martin,
>
> ka
>
>