[Python-Dev] Drastically improving list.sort() for lists of strings/ints

Tim Peters tim.peters at gmail.com
Sun Sep 11 17:42:22 EDT 2016


[redirected from python-dev, to python-ideas;
 please send followups only to python-ideas]

[Elliot Gorokhovsky <elgo8537 at colorado.edu>]
> ...
> TL;DR: Should I spend time making list.sort() detect if it is sorting
> lexicographically and, if so, use a much faster algorithm?

It will be fun to find out ;-)

As Mark, and especially Terry, pointed out, a major feature of the
current sort is that it can exploit many kinds of pre-existing order.
As the paper you referenced says, "Realistic sorting problems are
usually far from random."  But, although they did run some tests
against data with significant order, they didn't test against any
algorithms _aiming_ at exploiting uniformity.  Just against their
radix sort variants, and against a quicksort.

That's where it's likely next to impossible to guess in advance
whether radix sort _will_ have a real advantage.  All the kinds of
order the current sort can exploit are far from obvious, because the
mechanisms it employs are low-level & very general.  For example,
consider arrays created by this function, for even `n`:

    def bn(n):
        r = [None] * n
        r[0::2] = range(0, n//2)
        r[1::2] = range(n//2, n)
        return r

Then, e.g.,

>>> bn(16)
[0, 8, 1, 9, 2, 10, 3, 11, 4, 12, 5, 13, 6, 14, 7, 15]

This effectively takes range(n), cuts it in half, and does a "perfect
shuffle" on the two halves.

You'll find nothing in the current code looking for this as a special
case, but it nevertheless sorts such arrays in "close to" O(n) time,
and despite that there's no natural run in the input longer than 2
elements.

That said, I'd encourage you to write your code as a new list method
at first, to make it easiest to run benchmarks.  If that proves
promising, then you can worry about how to make a single method
auto-decide which algorithm to use.

Also use the two-array version.  It's easier to understand and to
code, and stability is crucial now.  The extra memory burden doesn't
bother me - an array of C pointers consumes little memory compared to
the memory consumed by the Python objects they point at.

Most of all, have fun! :-)


More information about the Python-Dev mailing list