[Python-Dev] Sorting
Tim Peters
tim.one@comcast.net
Sat, 27 Jul 2002 16:22:47 -0400
[Tim]
> ...
> I also noted that msort() gets a 32% speedup on my box when sorting a
> 1.33-million line snapshot of the Python-Dev archive. This is a puzzler
> to account for, since you wouldn't think there's significant pre-existing
> lexicographic order in a file like that. McIlroy noted similar results
> from experiments on text, PostScript and C source files in his adaptive
> mergesort (which is why I tried sorting Python-Dev to begin with), but
> didn't offer a hypothesis.
Just a note to clarify what "the puzzle" here is. msort() may or may not be
faster than sort() on random input on a given box due to platform quirks,
but that isn't relevant in this case. What McIlroy noted is that the total
# of compares done in these cases was significantly less than log2(N!).
That just can't happen (except with vanishingly small probability) if the
input data is randomly ordered, and under any comparison-based sorting
method.
The only hypothesis I have is that, for a stable sort, all the instances of
a given element are, by definition of stability, already "in sorted order".
So, e.g., "\n" is a popular line in text files, and all the occurrences of
"\n" are already sorted. msort can exploit that -- and seemingly does.
This doesn't necessarily contradict that ~sort happens to run slower on my
box under msort, because ~sort is such an extreme case.
OK, if I remove all but the first occurrence of each unique line, the # of
lines drops to about 600,000. The speedup msort enjoys also drops, to 6.8%.
So exploiting duplicates does appear to account for the bulk of it, but not
all of it.
If, after removing duplicates, I run that through random.shuffle() before
sorting, msort suffers an 8% slowdown(!) relative to samplesort.
If I shuffle first but don't remove duplicates, msort enjoys a 10% speedup.
So it's clear that msort is getting a significant advantage out of the
duplicates, but it's not at all clear what else it's exploiting -- only that
there is something else, and that it's significant. Now many times has
someone posted an alphabetical list of Python's keywords <wink>?