[Python-Dev] Drastically improving list.sort() for lists of strings/ints

Sun Sep 11 20:01:37 EDT 2016

Wow, Tim himself! Regarding performance on semi-ordered data: we'll have to
benchmark to see, but intuitively I imagine radix would meet Timsort
because verifying that a list of strings is sorted takes Omega(nw) (which
gives a lower bound on Timsort), where w is the word length. Radix sort is
Theta(nw). So at least asymptotically it checks out. I think if one uses
the two-array algorithm, other semi-sortings can also be exploited, since
the items get placed into their respective buckets in the order in which
they appear in the list. So, for the example you gave, one pass would sort
it correctly (since the list has the property if x_1 and x_2 are in bucket
b, x1 comes before x2 in the list, so x1 will also come before x2 in the
bucket. Except possibly for one "border bucket" that includes n/2). And
then it would just be Theta(nw/b) in each bucket to verify sorted. I mean
honestly the cool thing about radix is that the best case for Timsort on
strings, Omega(nw), is the worst case for radix! So the point is I think
the two array version, at least, preserves a lot of structure. Anyway, I
hope to have benchmarks (relatively) soon! (I'm a senior in high school so
I'm pretty busy...but I'll try to get on this as soon as I can).

On Sun, Sep 11, 2016 at 3:42 PM Tim Peters <tim.peters at gmail.com> wrote:

> [redirected from python-dev, to python-ideas;
>  please send followups only to python-ideas]
>
> [Elliot Gorokhovsky <elgo8537 at colorado.edu>]
> > ...
> > TL;DR: Should I spend time making list.sort() detect if it is sorting
> > lexicographically and, if so, use a much faster algorithm?
>
> It will be fun to find out ;-)
>
> As Mark, and especially Terry, pointed out, a major feature of the
> current sort is that it can exploit many kinds of pre-existing order.
> As the paper you referenced says, "Realistic sorting problems are
> usually far from random."  But, although they did run some tests
> against data with significant order, they didn't test against any
> algorithms _aiming_ at exploiting uniformity.  Just against their
> radix sort variants, and against a quicksort.
>
> That's where it's likely next to impossible to guess in advance
> whether radix sort _will_ have a real advantage.  All the kinds of
> order the current sort can exploit are far from obvious, because the
> mechanisms it employs are low-level & very general.  For example,
> consider arrays created by this function, for even `n`:
>
>     def bn(n):
>         r = [None] * n
>         r[0::2] = range(0, n//2)
>         r[1::2] = range(n//2, n)
>         return r
>
> Then, e.g.,
>
> >>> bn(16)
> [0, 8, 1, 9, 2, 10, 3, 11, 4, 12, 5, 13, 6, 14, 7, 15]
>
> This effectively takes range(n), cuts it in half, and does a "perfect
> shuffle" on the two halves.
>
> You'll find nothing in the current code looking for this as a special
> case, but it nevertheless sorts such arrays in "close to" O(n) time,
> and despite that there's no natural run in the input longer than 2
> elements.
>
> That said, I'd encourage you to write your code as a new list method
> at first, to make it easiest to run benchmarks.  If that proves
> promising, then you can worry about how to make a single method
> auto-decide which algorithm to use.
>
> Also use the two-array version.  It's easier to understand and to
> code, and stability is crucial now.  The extra memory burden doesn't
> bother me - an array of C pointers consumes little memory compared to
> the memory consumed by the Python objects they point at.
>
> Most of all, have fun! :-)
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-dev/attachments/20160912/891c1497/attachment.html>