[Tutor] Best approach to sort data based on several criteria

Jorge Godoy godoy at metalab.unc.edu
Thu Aug 14 12:08:14 EDT 2003


NOTE: I already solved that problem by using an RDBMS system, as
      suggested and acknowledged by me in prior posts, but I really
      think this discussion is useful for other programs/future
      use. :-) 

      I am very thankful for your answers.

Magnus Lyckå <magnus at thinkware.se> writes:

> The fastest approach in Python is typically to transform your data
> into a list which can be sorted directly with the list.sort()
> method.

Hmmmm... This is what I had in mind initially. The problem was with
the composite index scheme. 

> You can supply a function to the sort method call, but that makes
> sorting much slower. For big lists with high performance
> requirements, it's typically faster to create a new list that can be
> sorted as is.

Here it was my problem: performance. I needed it relatively fast. 

I hadn't, on the other hand, thought about creating 'sublists'.

Something like sorting a list accordingly to the first key, then
splitting each different 'first item' (the first key) on a new list
and sorting such a new list accordingly to the second key and
repeating the process 'til all data was sorted out would require too
much memory, wouldn't it? 

How could I do that and go easy on memory at the same time? 

> This is sometimes called a Schwartzian transform, after Perl guru
> Randal Schwartz. What you do in Python is to change a list of rows
> to a list of tuples: (sort_criteria, row), then sort it, and finally
> transform it back to the list of rows without the preceding sort
> criteria. This will also work with dates if they are given as
> objects of some date type/class such as mx.DateTime or the new
> datetime class in 2.3, or in a sane string format, such as ISO 8601.
>
> It could look something like this:
>
>  >>> l = [('a', 2),
>           ('c', 1),
>           ('b', 3)]
>  >>> col = 0
>  >>> for i, row in enumerate(l):
>          l[i] = (row[col], row)
>
>  >>> l.sort()
>  >>> for i, row in enumerate(l):
>          l[i] = row[1]
>
>  >>> l
> [('a', 2), ('b', 3), ('c', 1)]
>  >>> col = 1
>  >>> for i, row in enumerate(l):
>          l[i] = (row[col], row)
>
>  >>> l.sort()
>  >>> for i, row in enumerate(l):
>          l[i] = row[1]
>
>  >>> l
> [('c', 1), ('a', 2), ('b', 3)]

This is "easy" (you have to think this first, but after the idea the
code flows easily) for a two collumn list or for a little number of
columns. How would you handle that easily with, e.g., 5 sorting keys
working with 22 columns and several megabytes of data (the actual size
of the base is bigger than one hundred megabytes and this is another
reaso to use an RDBMS --- but make it something like 50K records just
to have an idea)? 


Just to let you know, the old 'system' used to use several
spreadsheets with weekly data on it and they had to open some of them
to make a useful analisys. With the database system, everything is
there and can be recovered in a much easier and effective way. 


See you,
-- 
Godoy.     <godoy at metalab.unc.edu>



More information about the Tutor mailing list