Complex sort on big files

Dan Stromberg drsalists at gmail.com
Mon Aug 1 18:45:54 CEST 2011


Python 2.x, or Python 3.x?

What are the types of your sort keys?

If you're on 3.x and the key you need reversed is numeric, you can negate
the key.

If you're on 2.x, you can use an object with a __cmp__ method to compare
objects however you require.

You probably should timsort the chunks (which is the standard list_.sort() -
it's a very good in-memory sort), and then merge them afterward using the
merge step of merge sort.

heapq's not unreasonable for the merging, but I think it's more common to
use a short list.

I have a bunch of Python sorts at
http://stromberg.dnsalias.org/svn/sorts/compare/trunk/ - if you find your
need is specialized (EG, 3.x sorting by a secondary key that's a string, in
reverse), you could adapt one of these to do what you need.

heapq is not bad, but if you find you need a logn datastructure, you might
check out http://stromberg.dnsalias.org/~dstromberg/treap/ - a treap is also
logn, but has a very good amortized cost because it sacrifices keeping
things perfectly balanced (and rebalancing, and rebalancing...) to gain
speed.  But still, you might be better off with a short list and min.

On Mon, Aug 1, 2011 at 8:33 AM, aliman <alimanfoo at googlemail.com> wrote:

> Hi all,
>
> Apologies I'm sure this has been asked many times, but I'm trying to
> figure out the most efficient way to do a complex sort on very large
> files.
>
> I've read the recipe at [1] and understand that the way to sort a
> large file is to break it into chunks, sort each chunk and write
> sorted chunks to disk, then use heapq.merge to combine the chunks as
> you read them.
>
> What I'm having trouble figuring out is what to do when I want to sort
> by one key ascending then another key descending (a "complex sort").
>
> I understand that sorts are stable, so I could just repeat the whole
> sort process once for each key in turn, but that would involve going
> to and from disk once for each step in the sort, and I'm wondering if
> there is a better way.
>
> I also thought you could apply the complex sort to each chunk before
> writing it to disk, so each chunk was completely sorted, but then the
> heapq.merge wouldn't work properly, because afaik you can only give it
> one key.
>
> Any help much appreciated (I may well be missing something glaringly
> obvious).
>
> Cheers,
>
> Alistair
>
> [1]
> http://code.activestate.com/recipes/576755-sorting-big-files-the-python-26-way/
> --
> http://mail.python.org/mailman/listinfo/python-list
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-list/attachments/20110801/2b78eb66/attachment.html>


More information about the Python-list mailing list