Sorting strings.

Christian Tismer tismer at tismer.com
Thu Feb 8 06:43:48 EST 2001


Gaute B Strokkenes wrote:
> 
> Hello!
> 
> One of my current pet projects is a python script that parses and
> sorts a library catalogue.  The catalogue is stored as a plain text
> file, and all that needs doing is to read in the author, title,
> etc. of each book, with some cleverness to detect multi-line entries,
> comments etc.
> 
> After all this is done, it is time to sort the entries in appropriate
> order before we print them out again.  Currently I do this:
> 
> def entry_cmp(a, b):
>     if a.author < b.author:
>         return -1
>     if a.author > b.author:
>         return 1
>     if a.title < b.title:
>         return -1
>     if a.title > b.title:
>         return 1
>     if a.prefix < b.prefix:
>         return -1
>     if a.prefix > b.prefix:
>         return 1
>     if a.code < b.code:
>         return -1
>     if a.code > b.code:
>         return 1
>     return 0
> 
> ...
> 
> ll.sort(entry_cmp)
> 
> The problem with this approach is that string comparison doesn't quite
> have the behaviour that I would like.  For instance, I'd like numbers
> to come after letters, and I'd like to be case insensitive.  Is there
> a reasonably clean way to do this?

There are several ways. As someone else pointed out, you can of
course make your cmp function as sophisticated as necessary.

Slightly less powerful, but reasonably faster is to calculate
a suitable key field (which might be a tuple) and do the
sorting on that, without supplying an extra cmp function.

Whatever you choose, you need to process your data a little,
either during comparison, or in preparation of the whole
sort.

The following seems quite natural to me:
For case insensitive sorting, use string.upper() to build
a case insensitive key. For proper sorting by numbers,
parse the numbers off from your string, and turn them
into real integer numbers.

You want numbers to come after letters.
This is a little uncomfortable, since comparison by
default assumes that strings are bigger than numbers.
In a cmp() implementation, you can address this directly.
But if you use the key pre-generation method, you need
to modify your key strings a little to make that work as
expected.

Here a function that parses one of your strings and splits
it into a (string, number, string) tuple:

import string, re

rex = re.compile("([^0-9]*)([0-9]*)(.*)")

def create_key(s):
  s = string.upper(s)
  s1, num, s2 = rex.match(s).groups()
  try:
    num = int(num)
    s1 = s1 + "x"
  except:
    num = -1
  return s1, num, s2

This function takes a string and creates a tuple for you
which contains a number matched inside your string, or
-1 if there is none. The sorting order is case independent,
and numbers go after strings. The latter is accomplished
by appending "x" to s1, if we found a number.

Example:

>>> x=map(create_key, ("ab", "ab1", "abc"))
>>> x.sort()
>>> x
[('AB', -1, ''), ('ABC', -1, ''), ('ABx', 1, '')]

Some mapping like this should be done for all of your
keys to sort, in the right order. Then you put all these
keys together to one tuple, and you attach your real
data to the end.

Example, just for data like (author, title):

def sortit(data_list):
  sorter = []
  for author, title in data_list:
  sorter.append( (create_key(author), create_key(title), (author, title)) )
  #                     # here a key     # there a key   # and here the data
  sorter.sort()
  result = []
  for record in sorter:
    result.append(record[-1]) # just the data
  return result

hope this helps - ciao - chris

-- 
Christian Tismer             :^)   <mailto:tismer at tismer.com>
Mission Impossible 5oftware  :     Have a break! Take a ride on Python's
Kaunstr. 26                  :    *Starship* http://starship.python.net
14163 Berlin                 :     PGP key -> http://wwwkeys.pgp.net
PGP Fingerprint       E182 71C7 1A9D 66E9 9D15  D3CC D4D7 93E2 1FAE F6DF
     where do you want to jump today?   http://www.stackless.com




More information about the Python-list mailing list