Complex sort on big files

Steven D'Aprano steve+comp.lang.python at
Sat Aug 6 05:30:33 CEST 2011

Roy Smith wrote:

> Wow.
> I was going to suggest using the unix command-line sort utility via
> popen() or subprocess.  My arguments were that it's written in C, has 30
> years of optimizing in it, etc, etc, etc.  It almost certainly has to be
> faster than anything you could do in Python.
> Then I tried the experiment.  I generated a file of 1 million random
> integers in the range 0 to 5000.  I wrote a little sorting program:
> Python took just about half the time.  Certainly knocked my socks off.
> Hard to believe, actually.

One million integers isn't very big. If each integer can fit in four-byte
long, that's less than 4MB. That's almost small enough to fit in your CPU's
cache, with room left over for the first few chapters of "War And Peace"

So you're comparing Python's timsort, which is Awesome with a capital AWE
but only works on data that fits in memory, versus something which can also
work on files too big to fit into memory.

Try generating a twenty gigabyte file of data, and sort that. Actually,
don't, because just *reading it in* to Python will probably fail, and very
possibly lock your PC up for the duration.

Unix sort does an external R-Way merge sort: if you have more data than
memory, it slices the data up into a bunch of smaller pieces, each of which
will fit in memory, sorts each one to a temporary file, then merges the
lot. It does a lot more work on such big files, because it *takes* a lot
more work.

For something as small as one million numbers, chances are the Unix sort
falls back on a heapsort or a quicksort, which will be pretty fast, but it
ain't no timsort.

So yes, Python's timsort is awesome, but so is Unix's sort, just in
different ways.


More information about the Python-list mailing list