Complex sort on big files

Tue Aug 2 06:25:05 EDT 2011

Hi Dan,

Thanks for the reply.

On Mon, Aug 1, 2011 at 5:45 PM, Dan Stromberg <drsalists at gmail.com> wrote:
>
> Python 2.x, or Python 3.x?

Currently Python 2.x.

> What are the types of your sort keys?

Both numbers and strings.

> If you're on 3.x and the key you need reversed is numeric, you can negate
> the key.

I did wonder about that. Would that not be doable also in Python 2.7,
using sorted(key=...)?

> If you're on 2.x, you can use an object with a __cmp__ method to compare
> objects however you require.

OK, right.

Looking at the HowTo/Sorting again [1] and the bit about cmp_to_key,
could you also achieve the same effect by returning a key with custom
implementations of rich comparison functions?

> You probably should timsort the chunks (which is the standard list_.sort() -
> it's a very good in-memory sort), and then merge them afterward using the
> merge step of merge sort.

Yes, that's what I understood by the activestate recipe [2].

So I guess my question boils down to, how do you do the merge step for
a complex sort? (Assuming each chunk had been completely sorted
first.)

Maybe the answer is also to construct a key with custom implementation
of rich comparisons?

Now I'm also wondering about the best way to sort each chunk. The
examples in [1] of complex sorts suggest the best way to do it is to
first sort by the secondary key, then sort by the primary key, relying
on the stability of the sort to get the desired outcome. But would it
not be better to call sorted() once, supplying a custom key function?

(As an aside, at the end of the section in [1] on Sort Stability and
Complex sorts, it says "The Timsort algorithm used in Python does
multiple sorts efficiently because it can take advantage of any
ordering already present in a dataset." - I get that that's true, but
I don't see how that's relevant to this strategy for doing complex
sorts. I.e., if you sort first by the secondary key, you don't get any
ordering that's helpful when you subsequently sort by the primary key.
...?)

(Sorry, another side question, I'm guessing reading a chunk of data
into a list and using Timsort, i.e., calling list.sort() or
sorted(mylist), is quicker than using bisect to keep the chunk sorted
as you build it?)

> heapq's not unreasonable for the merging, but I think it's more common to
> use a short list.

Do you mean a regular Python list, and calling min()?

> I have a bunch of Python sorts at
> http://stromberg.dnsalias.org/svn/sorts/compare/trunk/ - if you find your
> need is specialized (EG, 3.x sorting by a secondary key that's a string, in
> reverse), you could adapt one of these to do what you need.

Thanks. Re 3.x sorting by a secondary key that's a string, in reverse,
which one were you thinking of in particular?

> heapq is not bad, but if you find you need a logn datastructure, you might
> check out http://stromberg.dnsalias.org/~dstromberg/treap/ - a treap is also
> logn, but has a very good amortized cost because it sacrifices keeping
> things perfectly balanced (and rebalancing, and rebalancing...) to gain
> speed.  But still, you might be better off with a short list and min.

Thanks, that's really helpful.

Cheers,

Alistair

[1] http://wiki.python.org/moin/HowTo/Sorting/
[2] http://code.activestate.com/recipes/576755-sorting-big-files-the-python-26-way/

> On Mon, Aug 1, 2011 at 8:33 AM, aliman <alimanfoo at googlemail.com> wrote:
>>
>> Hi all,
>>
>> Apologies I'm sure this has been asked many times, but I'm trying to
>> figure out the most efficient way to do a complex sort on very large
>> files.
>>
>> I've read the recipe at [1] and understand that the way to sort a
>> large file is to break it into chunks, sort each chunk and write
>> sorted chunks to disk, then use heapq.merge to combine the chunks as
>> you read them.
>>
>> What I'm having trouble figuring out is what to do when I want to sort
>> by one key ascending then another key descending (a "complex sort").
>>
>> I understand that sorts are stable, so I could just repeat the whole
>> sort process once for each key in turn, but that would involve going
>> to and from disk once for each step in the sort, and I'm wondering if
>> there is a better way.
>>
>> I also thought you could apply the complex sort to each chunk before
>> writing it to disk, so each chunk was completely sorted, but then the
>> heapq.merge wouldn't work properly, because afaik you can only give it
>> one key.
>>
>> Any help much appreciated (I may well be missing something glaringly
>> obvious).
>>
>> Cheers,
>>
>> Alistair
>>
>> [1]
>> http://code.activestate.com/recipes/576755-sorting-big-files-the-python-26-way/
>> --
>> http://mail.python.org/mailman/listinfo/python-list
>
>

-- 
--
Alistair Miles
Head of Epidemiological Informatics
Centre for Genomics and Global Health <http://cggh.org>
The Wellcome Trust Centre for Human Genetics
Roosevelt Drive
Oxford
OX3 7BN
United Kingdom
Web: http://purl.org/net/aliman
Email: alimanfoo at gmail.com
Tel: +44 (0)1865 287669