Complex sort on big files
Steven D'Aprano
steve+comp.lang.python at pearwood.info
Fri Aug 5 23:30:33 EDT 2011
Roy Smith wrote:
> Wow.
>
> I was going to suggest using the unix command-line sort utility via
> popen() or subprocess. My arguments were that it's written in C, has 30
> years of optimizing in it, etc, etc, etc. It almost certainly has to be
> faster than anything you could do in Python.
>
> Then I tried the experiment. I generated a file of 1 million random
> integers in the range 0 to 5000. I wrote a little sorting program:
[...]
> Python took just about half the time. Certainly knocked my socks off.
> Hard to believe, actually.
One million integers isn't very big. If each integer can fit in four-byte
long, that's less than 4MB. That's almost small enough to fit in your CPU's
cache, with room left over for the first few chapters of "War And Peace"
*wink*
So you're comparing Python's timsort, which is Awesome with a capital AWE
but only works on data that fits in memory, versus something which can also
work on files too big to fit into memory.
Try generating a twenty gigabyte file of data, and sort that. Actually,
don't, because just *reading it in* to Python will probably fail, and very
possibly lock your PC up for the duration.
Unix sort does an external R-Way merge sort: if you have more data than
memory, it slices the data up into a bunch of smaller pieces, each of which
will fit in memory, sorts each one to a temporary file, then merges the
lot. It does a lot more work on such big files, because it *takes* a lot
more work.
For something as small as one million numbers, chances are the Unix sort
falls back on a heapsort or a quicksort, which will be pretty fast, but it
ain't no timsort.
So yes, Python's timsort is awesome, but so is Unix's sort, just in
different ways.
--
Steven
More information about the Python-list
mailing list