[Python-Dev] Sorting

Tue, 23 Jul 2002 04:30:11 -0400

[Scott Gilbert]
> I'm curious if there is any literature that you've come across, or if
> you've done any experiments with merging more than two parts at a
> time.

There's a literal mountain of research on the topic.  I recommend

    "A Meticulous Analysis of Mergesort Programs"
    Jyrki Katajainen, Jesper Larsson Traff

for a careful accounting of all operations that go into one of these beasts.
They got the best results (and much better than quicksort) out of a 4-way
bottom-up mergesort via very tedious code (e.g., it effectively represents
which input run currently has the smallest next key via the program counter,
by way of massive code duplication and oodles of gotos); they were afraid to
write actual code for an 8-way version <wink>.  OTOH, they were sorting
random integers, and, e.g., were delighted to increase the # of comparisons
when that could save a few other "dirt cheap" operations.

> ...
> My thinking is that many machines (probably yours for instance) have a
> cache that is 4-way associative, so merging only 2 blocks at a time might
> not be using the cache as well as it could.  Also, changing from merging 2
> blocks to 3 or 4 blocks at a time would change the number of passes you
> have to make (the log part of N*log(N)).
>
> It's quite possible that this isn't worth the trade off in complexity and
> space (or your time :-).

The real reason it's uninteresting to me is that it has no clear
applicability to the cases this sort aims at:  exploiting significant
pre-existing order of various kinds.  That leads to unbalanced run lengths
when we're lucky, and if I'm merging a 2-element run with a 100,000-element
run, high cache associativity isn't of much use.  From the timings I showed
before, it's clear that "good cases" of pre-existing order take time that
depends almost entirely on just the number of comparisons needed; e.g.,
3sort and +sort were as fast as /sort, where the latter does nothing but N-1
comparisons in a single left-to-right scan of the array.  Comparisons are
expensive enough in Python that doing O(log N) additional comparisons in
3sort and +sort, then moving massive amounts of the array around to fit the
oddballs in place, costs almosts nothing more in percentage terms.  Since
these cases are already effectively as fast as a single left-to-right scan,
there's simply no potential remaining for significant gain (unless you can
speed a single left-to-right scan!  that would be way cool).

If you think you can write a sort for random Python arrays faster than the
samplesort hybrid, be my guest:  I'd love to see it!  You should be aware
that I've been making this challenge for years <wink>.

Something to note:  I think you have an overly simple view of Python's lists
in mind.  When we're merging two runs in the timing test, it's not *just*
the list memory that's getting scanned.  The lists contain pointers *to*
float objects.  The float objects have to get read up from memory too, and
there goes the rest of your 4-way associativity.  Indeed, if you read the
comments in Lib/test/sortperf.py, you'll find that it performs horrid
trickery to ensure that =sort and =sort work on physically distinct float
objects; way back when, these particular tests ran much faster, and that
turned out to be partly because, e.g., [0.5] * N constructs a list with N
pointers to a single float object, and that was much easier on the memory
system.  We got a really nice slowdown <wink> by forcing N distinct copies
of 0.5.  In earlier Pythons the comparison also got short-circuited by an
early pointer-equality test ("if they're the same object, they must be
equal"), but that's not done anymore.  A quick run just now showed that
=sort still runs significantly quicker if given a list of identical objects;
the only explanation left for that appears to be cache effects.

> Keeping track of comparisons that you've already made could get ugly,

Most researches have found that a fancy data structure for this is
counter-productive:  so long as the m in m-way merging isn't ridiculously
large, keeping the head entries in a straight vector with m elements runs
fastest.  But they're not worried about Python's expensive-comparison case.
External sorts using m-way merging with large m typically use a selection
tree much like a heap to reduce the expense of keeping track (see, e.g.,
Knuth for details).

> and your temp space requirement would go from N/2 to possibly 3N/4...
> But since you're diving so deeply into this problem, I figured I'd
> throw it out there.
>
> OTOH, this could be close to the speedup that heavily optimized FFT algs
> get when they go from radix-2 to radix-4.   Just thinking out loud...

I don't think that's comparable.  Moving to radix 4 cuts the total number of
non-trivial complex multiplies an FFT has to do, and non-trivial complex
multiplies are the expensive part of what an FFT does.  In contrast,
boosting the m in m-way merging doesn't cut the number of comparisons needed
at all (to the contrary, if you're not very careful it increases them), and
comparisons are what kill sorting routines in Python.  The elaborate
gimmicks in timsort for doing merges of unbalanced runs do cut the total
number of comparisons needed, and that's where the huge wins come from.