[Python-Dev] Sorting

Tim Peters tim.one@comcast.net
Sat, 27 Jul 2002 16:22:47 -0400

> ...
> I also noted that msort() gets a 32% speedup on my box when sorting a
> 1.33-million line snapshot of the Python-Dev archive.  This is a puzzler
> to account for, since you wouldn't think there's significant pre-existing
> lexicographic order in a file like that.  McIlroy noted similar results
> from experiments on text, PostScript and C source files in his adaptive
> mergesort (which is why I tried sorting Python-Dev to begin with), but
> didn't offer a hypothesis.

Just a note to clarify what "the puzzle" here is.  msort() may or may not be
faster than sort() on random input on a given box due to platform quirks,
but that isn't relevant in this case.  What McIlroy noted is that the total
# of compares done in these cases was significantly less than log2(N!).
That just can't happen (except with vanishingly small probability) if the
input data is randomly ordered, and under any comparison-based sorting

The only hypothesis I have is that, for a stable sort, all the instances of
a given element are, by definition of stability, already "in sorted order".
So, e.g., "\n" is a popular line in text files, and all the occurrences of
"\n" are already sorted.  msort can exploit that -- and seemingly does.
This doesn't necessarily contradict that ~sort happens to run slower on my
box under msort, because ~sort is such an extreme case.

OK, if I remove all but the first occurrence of each unique line, the # of
lines drops to about 600,000.  The speedup msort enjoys also drops, to 6.8%.
So exploiting duplicates does appear to account for the bulk of it, but not
all of it.

If, after removing duplicates, I run that through random.shuffle() before
sorting, msort suffers an 8% slowdown(!) relative to samplesort.

If I shuffle first but don't remove duplicates, msort enjoys a 10% speedup.

So it's clear that msort is getting a significant advantage out of the
duplicates, but it's not at all clear what else it's exploiting -- only that
there is something else, and that it's significant.  Now many times has
someone posted an alphabetical list of Python's keywords <wink>?