Complex sort on big files

Tue Aug 2 13:07:44 EDT 2011

On Tue, Aug 2, 2011 at 3:25 AM, Alistair Miles <alimanfoo at googlemail.com>wrote:

> Hi Dan,
>
> Thanks for the reply.
>
> On Mon, Aug 1, 2011 at 5:45 PM, Dan Stromberg <drsalists at gmail.com> wrote:
> >
> > Python 2.x, or Python 3.x?
>
> Currently Python 2.x.
>

So it sounds like you may want to move this code to 3.x in the future.

> > What are the types of your sort keys?
>
> Both numbers and strings.
>
> > If you're on 3.x and the key you need reversed is numeric, you can negate
> > the key.
>
> I did wonder about that. Would that not be doable also in Python 2.7,
> using sorted(key=...)?
>

Yes, probably, though in 2.x it's easy to just use __cmp__.

> > If you're on 2.x, you can use an object with a __cmp__ method to compare
> > objects however you require.
>
> OK, right.
>
> Looking at the HowTo/Sorting again [1] and the bit about cmp_to_key,
> could you also achieve the same effect by returning a key with custom
> implementations of rich comparison functions?
>

Yes, I believe so,. though then you're back to a bit slower sorting.

> > You probably should timsort the chunks (which is the standard
> list_.sort() -
> > it's a very good in-memory sort), and then merge them afterward using the
> > merge step of merge sort.
>
> Yes, that's what I understood by the activestate recipe [2].
>
> So I guess my question boils down to, how do you do the merge step for
> a complex sort? (Assuming each chunk had been completely sorted
> first.)
>

Open one file for each sorted list on disk.  Save these files in a
file_list, indexed by an integer (of course, since this is a list) - doing
this means you have a mapping from an integer to a file.

Read the first value from each such file.  Note that one of these values
will become the least value in the result list - the least value in the
result file has to be the least value from one of the sorted sublists.

Fill some datastructure (list, heap, treap, etc.) with tuples
(value_from_sorted_list1, index_into_file_dict1), (value_from_sorted_list2,
index_into_file_dict2), etc.  Then pull  out  the least value from the
list/heap/treap, and add it to the result file,  and read another value from
the _same_ file (which you can get back from your file_list) you got back
and add a replacement tuple to your list/heap/treap.

In this way, each step of the way, when merging n files, your
list/heap/treap always has n values - until you reach the end of one or more
of your sorted files.  Then you always have less than n.

When you hit EOF on some file, you just don't add anything back for that
file.  Resist the temptation to remove it from your file_list - that'll mess
up the file indexes in the list/heap/treap.

Continue as long as you have an open file.

We're using a tuple instead of a class, because comparing tuples tends to be
pretty fast.  We're using file indexes instead of files, because I don't
want to think about what happens when you compare two files with the <
operator.  :)

Maybe the answer is also to construct a key with custom implementation
> of rich comparisons?
>

Could be related.

> Now I'm also wondering about the best way to sort each chunk. The
> examples in [1] of complex sorts suggest the best way to do it is to
> first sort by the secondary key, then sort by the primary key, relying
> on the stability of the sort to get the desired outcome. But would it
> not be better to call sorted() once, supplying a custom key function?
>

Function calls in CPython are expensive, so if you can sort (twice) using a
key (decorate, sort, undecorate), that's probably a performance win.  It's
kind of a Software Engineering lose though, so using a class with one or
more comparison methods can be a benefit too.

> (As an aside, at the end of the section in [1] on Sort Stability and
> Complex sorts, it says "The Timsort algorithm used in Python does
> multiple sorts efficiently because it can take advantage of any
> ordering already present in a dataset." - I get that that's true, but
> I don't see how that's relevant to this strategy for doing complex
> sorts. I.e., if you sort first by the secondary key, you don't get any
> ordering that's helpful when you subsequently sort by the primary key.
> ...?)
>

Agreed.  Sounds irrelevant.

> (Sorry, another side question, I'm guessing reading a chunk of data
> into a list and using Timsort, i.e., calling list.sort() or
> sorted(mylist), is quicker than using bisect to keep the chunk sorted
> as you build it?)
>

Oh yeah.  The bisection itself  is just O(logn), but inserting into the
correct place afterward is O(n).  So unless n is tiny, you're better off
with a heap or treap or red-black tree.  It's the difference between an
overall O(nlogn) algorithm and an O(n^2) algorithm - for large n, it's a big
difference.

> > heapq's not unreasonable for the merging, but I think it's more common to
> > use a short list.
>
> Do you mean a regular Python list, and calling min()?
>

Yes.  Depending on how many lists you wish to merge at a time.

Note that in terms of file I/O, it may be faster to open as many file
descriptors as you can and use a logn datastructure for the merges.  But
recall that you don't get infinite file descriptors, so there's a benefit to
being able to do them m at a time, for some reasonable m.  Merge sort is
commonly written with m=2 for simplicity.

> I have a bunch of Python sorts at
> > http://stromberg.dnsalias.org/svn/sorts/compare/trunk/ - if you find
> your
> > need is specialized (EG, 3.x sorting by a secondary key that's a string,
> in
> > reverse), you could adapt one of these to do what you need.
>
> Thanks. Re 3.x sorting by a secondary key that's a string, in reverse,
> which one were you thinking of in particular?
>

timsort is great for in-memory sorting, but the version that comes with
CPython is faster than the Cython version at the URL.  But there's a merge
sort at the URL that you could glance at for just the merge step.

But if you find you need some code you can adapt to specific needs, the
timsort at the URL is probably a good option.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-list/attachments/20110802/25c6849b/attachment.html>