[Python-Dev] Sorting

Sat, 20 Jul 2002 02:18:16 -0400

An enormous amount of research has been done on sorting since the last time
I wrote a sort for Python.  Major developments have been in two areas:

1. Adaptive sorting.  Sorting algorithms are usually tested on random
   data, but in real life you almost never see random data.  Python's
   sort tries to catch some common cases of near-order via special-
   casing.  The literature has since defined more than 15 formal
   measures of disorder, and developed algorithms provably optimal in
   the face of one or more of them.  But this is O() optimality, and
   theoreticians aren't much concerned about how big the constant
   factor is.  Some researchers are up front about this, and toward
   the end of one paper with "practical" in its title, the author was
   overjoyed to report that an implementation was only twice as slow
   as a naive quicksort <wink>.

2. Pushing the worst-case number of comparisons closer to the
   information-theoretic limit (ceiling(log2(N!))).

I don't care much about #2 -- in experiments conducted when it was new, I
measured the # of comparisons our samplesort hybrid did on random inputs,
and it was never more than 2% over the theoretical lower bound, and
typically closer.  As N grows large, the expected case provably converges to
the theoretical lower bound.  There remains a vanishly small chance for a
bad case, but nobody has reported one, and at the time I gave up trying to
construct one.

Back on Earth, among Python users the most frequent complaint I've heard is
that list.sort() isn't stable.  Alex is always quick to trot out the
appropriate DSU (Decorate Sort Undecorate) pattern then, but the extra
memory burden for that can be major (a new 2-tuple per list element costs
about 32 bytes, then 4 more bytes for a pointer to it in a list, and 12 more
bytes that don't go away to hold each non-small index).

After reading all those papers, I couldn't resist taking a crack at a new
algorithm that might be practical, and have something you might call a
non-recursive adaptive stable natural mergesort / binary insertion sort
hybrid.  In playing with it so far, it has two bad aspects compared to our
samplesort hybrid:

+ It may require temp memory, up to 2*N bytes worst case (one pointer each
  for no more than half the array elements).

+ It gets *some* benefit for arrays with many equal elements, but not
  nearly as much as I was able to hack samplesort to get.  In effect,
  paritioning is very good at moving equal elements close to each other
  quickly, but merging leaves them spread across any number of runs.
  This is especially irksome because we're sticking to Py_LT for
  comparisons, so can't even detect a==b without comparing a and b twice
  (and then it's a deduction from that not a < b and not b < a).  Given
  the relatively huge cost of comparisons, it's a timing disaster to do
  that (compare twice) unless it falls out naturally.  It was fairly
  natural to do so in samplesort, but not at all in this sort.

It also has good aspects:

+ It's stable (items that compare equal retain their relative order, so,
  e.g., if you sort first on zip code, and a second time on name, people
  with the same name still appear in order of increasing zip code; this
  is important in apps that, e.g., refine the results of queries based
  on user input).

+ The code is much simpler than samplesort's (but I think I can
  fix that <wink>).

+ It gets benefit out of more kinds of patterns, and without lumpy
  special-casing (a natural mergesort has to identify ascending and
  descending runs regardless, and then the algorithm builds on just
  that).

+ Despite that I haven't micro-optimized it, in the random case it's
  almost as fast as the samplesort hybrid.  In fact, it might
  have been a bit faster had I run tests yesterday (the samplesort
  hybrid got sped up by 1-2% last night).  This one surprised me the
  most, because at the time I wrote the samplesort hybrid, I tried
  several ways of coding mergesorts and couldn't make it as fast.

+ It has no bad cases (O(N log N) is worst case; N-1 compares is best).

Here are some typical timings, taken from Python's sortperf.py, over
identical lists of floats:

Key:
    *sort: random data
    \sort: descending data
    /sort: ascending data
    3sort: ascending data but with 3 random exchanges
    ~sort: many duplicates
    =sort: all equal
    !sort: worst case scenario

That last one was a worst case for the last quicksort Python had before it
grew the samplesort, and it was a very bad case for that.  By sheer
coincidence, turns out it's an exceptionally good case for the experimental
sort:

samplesort
 i    2**i  *sort  \sort  /sort  3sort  ~sort  =sort  !sort
15   32768   0.13   0.01   0.01   0.10   0.04   0.01   0.11
16   65536   0.24   0.02   0.02   0.23   0.08   0.02   0.24
17  131072   0.54   0.05   0.04   0.49   0.18   0.04   0.53
18  262144   1.18   0.09   0.09   1.08   0.37   0.09   1.16
19  524288   2.58   0.19   0.18   2.34   0.76   0.17   2.52
20 1048576   5.58   0.37   0.36   5.12   1.54   0.35   5.46

timsort
15   32768   0.16   0.01   0.02   0.05   0.14   0.01   0.02
16   65536   0.24   0.02   0.02   0.06   0.19   0.02   0.04
17  131072   0.55   0.04   0.04   0.13   0.42   0.04   0.09
18  262144   1.19   0.09   0.09   0.25   0.91   0.09   0.18
19  524288   2.60   0.18   0.18   0.46   1.97   0.18   0.37
20 1048576   5.61   0.37   0.35   1.00   4.26   0.35   0.74

If it weren't for the ~sort column, I'd seriously suggest replacing the
samplesort with this.  2*N extra bytes isn't as bad as it might sound, given
that, in the absence of massive object duplication, each list element
consumes at least 12 bytes (type pointer, refcount and value) + 4 bytes for
the list pointer.  Add 'em all up and that's a 13% worst-case temp memory
overhead.