<br><div class="gmail_quote">On Tue, Aug 2, 2011 at 3:25 AM, Alistair Miles <span dir="ltr"><<a href="mailto:alimanfoo@googlemail.com">alimanfoo@googlemail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">

Hi Dan,<br>

<br>

Thanks for the reply.<br>

<div class="im"><br>

On Mon, Aug 1, 2011 at 5:45 PM, Dan Stromberg <<a href="mailto:drsalists@gmail.com">drsalists@gmail.com</a>> wrote:<br>

><br>

> Python 2.x, or Python 3.x?<br>

<br>

</div>Currently Python 2.x.<br><div class="im"></div></blockquote><div><br>So it sounds like you may want to move this code to 3.x in the future.<br> </div><blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;">

<div class="im">

> What are the types of your sort keys?<br>

<br>

</div>Both numbers and strings.<br>

<div class="im"><br>

> If you're on 3.x and the key you need reversed is numeric, you can negate<br>

> the key.<br>

<br>

</div>I did wonder about that. Would that not be doable also in Python 2.7,<br>

using sorted(key=...)?<br><div class="im"></div></blockquote><div><br>Yes, probably, though in 2.x it's easy to just use __cmp__.<br> </div><blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;">

<div class="im">

> If you're on 2.x, you can use an object with a __cmp__ method to compare<br>

> objects however you require.<br>

<br>

</div>OK, right.<br>

<br>

Looking at the HowTo/Sorting again [1] and the bit about cmp_to_key,<br>

could you also achieve the same effect by returning a key with custom<br>

implementations of rich comparison functions?<br><div class="im"></div></blockquote><div><br>Yes, I believe so,. though then you're back to a bit slower sorting.<br> </div><blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;">

<div class="im">

> You probably should timsort the chunks (which is the standard list_.sort() -<br>

> it's a very good in-memory sort), and then merge them afterward using the<br>

> merge step of merge sort.<br>

<br>

</div>Yes, that's what I understood by the activestate recipe [2].<br>

<br>

So I guess my question boils down to, how do you do the merge step for<br>

a complex sort? (Assuming each chunk had been completely sorted<br>

first.)<br></blockquote><div><br>Open one file for each sorted list on disk.  Save these files in a file_list, indexed by an integer (of course, since this is a list) - doing this means you have a mapping from an integer to a file.<br>

<br>Read the first value from each such file.  Note that one of these values will become the least value in the result list - the least value in the result file has to be the least value from one of the sorted sublists.<br>

<br>Fill some datastructure (list, heap, treap, etc.) with tuples (value_from_sorted_list1, index_into_file_dict1), (value_from_sorted_list2, index_into_file_dict2), etc.  Then pull  out  the least value from the list/heap/treap, and add it to the result file,  and read another value from the _same_ file (which you can get back from your file_list) you got back and add a replacement tuple to your list/heap/treap.<br>

<br>In this way, each step of the way, when merging n files, your list/heap/treap always has n values - until you reach the end of one or more of your sorted files.  Then you always have less than n.<br><br>When you hit EOF on some file, you just don't add anything back for that file.  Resist the temptation to remove it from your file_list - that'll mess up the file indexes in the list/heap/treap.<br>

<br>Continue as long as you have an open file.<br><br>We're using a tuple instead of a class, because comparing tuples tends to be pretty fast.  We're using file indexes instead of files, because I don't want to think about what happens when you compare two files with the < operator.  :)<br>

<br></div><blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;">

Maybe the answer is also to construct a key with custom implementation<br>

of rich comparisons?<br></blockquote><div><br>Could be related.<br> </div><blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;">

Now I'm also wondering about the best way to sort each chunk. The<br>

examples in [1] of complex sorts suggest the best way to do it is to<br>

first sort by the secondary key, then sort by the primary key, relying<br>

on the stability of the sort to get the desired outcome. But would it<br>

not be better to call sorted() once, supplying a custom key function?<br></blockquote><div><br>Function calls in CPython are expensive, so if you can sort (twice) using a key (decorate, sort, undecorate), that's probably a performance win.  It's kind of a Software Engineering lose though, so using a class with one or more comparison methods can be a benefit too.<br>

 </div><blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;">

(As an aside, at the end of the section in [1] on Sort Stability and<br>

Complex sorts, it says "The Timsort algorithm used in Python does<br>

multiple sorts efficiently because it can take advantage of any<br>

ordering already present in a dataset." - I get that that's true, but<br>

I don't see how that's relevant to this strategy for doing complex<br>

sorts. I.e., if you sort first by the secondary key, you don't get any<br>

ordering that's helpful when you subsequently sort by the primary key.<br>

...?)<br></blockquote><div><br>Agreed.  Sounds irrelevant.<br> </div><blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;">

(Sorry, another side question, I'm guessing reading a chunk of data<br>

into a list and using Timsort, i.e., calling list.sort() or<br>

sorted(mylist), is quicker than using bisect to keep the chunk sorted<br>

as you build it?)<br><div class="im"></div></blockquote><div><br>Oh yeah.  The bisection itself  is just O(logn), but inserting into the correct place afterward is O(n).  So unless n is tiny, you're better off with a heap or treap or red-black tree.  It's the difference between an overall O(nlogn) algorithm and an O(n^2) algorithm - for large n, it's a big difference.<br>

 </div><blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;"><div class="im">

> heapq's not unreasonable for the merging, but I think it's more common to<br>

> use a short list.<br>

<br>

</div>Do you mean a regular Python list, and calling min()?<br><div class="im"></div></blockquote><div><br>Yes.  Depending on how many lists you wish to merge at a time.<br> <br>Note that in terms of file I/O, it may be faster to open as many file descriptors as you can and use a logn datastructure for the merges.  But recall that you don't get infinite file descriptors, so there's a benefit to being able to do them m at a time, for some reasonable m.  Merge sort is commonly written with m=2 for simplicity.<br>

<br></div><blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;"><div class="im">

> I have a bunch of Python sorts at<br>

> <a href="http://stromberg.dnsalias.org/svn/sorts/compare/trunk/" target="_blank">http://stromberg.dnsalias.org/svn/sorts/compare/trunk/</a> - if you find your<br>

> need is specialized (EG, 3.x sorting by a secondary key that's a string, in<br>

> reverse), you could adapt one of these to do what you need.<br>

<br>

</div>Thanks. Re 3.x sorting by a secondary key that's a string, in reverse,<br>

which one were you thinking of in particular?<br><div class="im"></div></blockquote><div><br>timsort is great for in-memory sorting, but the version that comes with CPython is faster than the Cython version at the URL.  But there's a merge sort at the URL that you could glance at for just the merge step.<br>

 <br>But if you find you need some code you can adapt to specific needs, the timsort at the URL is probably a good option.<br><br></div><br></div>