Complex sort on big files

Fri Aug 5 23:30:33 EDT 2011

Roy Smith wrote:

> Wow.
> 
> I was going to suggest using the unix command-line sort utility via
> popen() or subprocess.  My arguments were that it's written in C, has 30
> years of optimizing in it, etc, etc, etc.  It almost certainly has to be
> faster than anything you could do in Python.
> 
> Then I tried the experiment.  I generated a file of 1 million random
> integers in the range 0 to 5000.  I wrote a little sorting program:
[...]
> Python took just about half the time.  Certainly knocked my socks off.
> Hard to believe, actually.

One million integers isn't very big. If each integer can fit in four-byte
long, that's less than 4MB. That's almost small enough to fit in your CPU's
cache, with room left over for the first few chapters of "War And Peace"
*wink*

So you're comparing Python's timsort, which is Awesome with a capital AWE
but only works on data that fits in memory, versus something which can also
work on files too big to fit into memory.

Try generating a twenty gigabyte file of data, and sort that. Actually,
don't, because just *reading it in* to Python will probably fail, and very
possibly lock your PC up for the duration.

Unix sort does an external R-Way merge sort: if you have more data than
memory, it slices the data up into a bunch of smaller pieces, each of which
will fit in memory, sorts each one to a temporary file, then merges the
lot. It does a lot more work on such big files, because it *takes* a lot
more work.

For something as small as one million numbers, chances are the Unix sort
falls back on a heapsort or a quicksort, which will be pretty fast, but it
ain't no timsort.

So yes, Python's timsort is awesome, but so is Unix's sort, just in
different ways.

-- 
Steven