[Python-ideas] [Python-Dev] Drastically improving list.sort() for lists of strings/ints

Mon Sep 12 23:18:03 EDT 2016

[Elliot Gorokhovsky <elliot.gorokhovsky at gmail.com>]
> Wow, Tim himself!

And Elliot himself!  It's a party :-)

> Regarding performance on semi-ordered data: we'll have to
> benchmark to see, but intuitively I imagine radix would meet Timsort
> because verifying that a list of strings is sorted takes Omega(nw)
> (which gives a lower bound on Timsort), where w is the word length.

Don't take that stuff so seriously.  The number of character
comparisons needed when comparing two strings is _typically_ small.
Consecutive corresponding characters are compared only until the first
non-equal pair is found.  For example, if the alphabet is 256
characters, for random strings it only requires one character
comparison 255 of 256 times on average (because there's only 1 chance
in 256 that the first characters _are_ equal).

> Radix sort is Theta(nw). So at least asymptotically it checks out.

Analogously, most-significant-byte-first radix sort stops on the pass
after the longest common prefix is processed.  But the real devils are
in the details.  More on that below.

> I think if one uses the two-array algorithm, other semi-sortings can also
> be exploited, since the items get placed into their respective buckets
> in the order in which they appear in the list. So, for the example you gave,
> one pass would sort it correctly

Can't know that unless you specify which algorithm you have in mind.
If you're talking about the one in the paper, _any_ sequence with
values all from range(256) will be sorted in the first pass, because
each value would get its own bucket.

> (since the list has the property if x_1 and x_2 are in bucket b, x1 comes
> before x2 in the list, so x1 will also come before x2 in the bucket. Except
> possibly for one "border bucket" that includes n/2). And then
> it would just be Theta(nw/b) in each bucket to verify sorted.

I don't think you want to go there.  If you're keen on radix sorting,
stick to radix sorting until it's as good as you can make it.  "To
verify sorted" requires the (potentially) arbitrarily expensive kinds
of comparisons you're trying to _eliminate_.  That's not saying
there's no possible advantage from doing some of each; it's warning
that pursuing every off-target idea that comes up will leave you
spinning in circles.

> I mean honestly the cool thing about radix is that the best case for
> Timsort on strings, Omega(nw), is the worst case for radix!

"Details":  I wrote a prototype radix sort along the lines of the
paper's 2-array version, in Python.  When there's "a real" O()
improvement to be had, one can usually demonstrate it by a Python
program incorporating the improvement even though it's fighting
against (unimproved) C code.  For example, I helped someone here craft
a Python radix sort for integers that runs faster than list.sort(),
although it required a list over 10,000,000 elements and a radix of at
least 4096 ;-)

    http://stackoverflow.com/questions/20207791/pushing-radix-sort-and-python-to-its-limits

However, I'm not having that kind of luck with strings.  Here's a
detail that bites:  as above, string comparisons _usually_ get out
fast, after comparing just a few characters.  So to construct a bad
case for list.sort() in this respect, I want lots of strings with a
long common prefix.  For example, suppose we have a list of a million
strings each of which starts with "A"*20000 (20 thousand "A").  String
comparisons are then very expensive (each one requires at least 20001
character comparisons to resolve).

But that also makes the radix sort excruciatingly slow!  On the first
pass, it copies the million elements, puts them all in the same
bucket, then copies them all back to exactly where they started from.
Pass 2, exactly the same, but looking at the 2nd occurrence of "A" in
each string.  Ditto for pass 3, pass 4, and ... pass 20,000.  By then
we've copied a million elements 40,000 times - but still have exactly
the same array we started with.  The data-copying costs are killing it
(yes, it's only copying pointers, but 40 billion pointer copies aren't
cheap  - and on my 64-bit box, amounts to copying 320 billion bytes).

On the same data, list.sort() is doing a great many character
comparisons, but doing _far_ less pointer copying, and that leaves it
running circles around the radix sort - list.sort() can't copy
pointers more than about N*log2(N) times, and log2(1000000) is much
smaller than 40,000.

To be complete, I'll attach the code I used.  Note that the paper is
only concerned with "C strings":  0 bytes are special, acting as
string terminators.  0 bytes aren't special in Python, so we need 257
buckets.  I'm also using Python 3, where strings are always Unicode,
and code points (`ord()` results) can be as large as 1,114,111.  So
this radix sort can't work directly on Python 3 strings; instead it
expects a list of `bytes` (or `bytearray`) objects.

    def rsort(xs):
        count = [0] * 257
        bindex = [None] * 257
        stack = [(0, len(xs), 0)] # start index, length, byte index
        push = stack.append
        pop = stack.pop
        while stack:
            assert all(c == 0 for c in count)
            si, n, bi = pop()
            txs = xs[si: si + n]
            #if n < 30:
            #    xs[si: si + n] = sorted(txs)
            #    continue
            minbyte, maxbyte = 255, 0
            for x in txs:
                b = x[bi] if bi < len(x) else -1
                count[b] += 1
                if count[b] == 1 and b >= 0:
                    if b < minbyte:
                        minbyte = b
                    if b > maxbyte:
                        maxbyte = b
            bindex[-1] = si
            sofar = si + count[-1]
            count[-1] = 0
            for b in range(minbyte, maxbyte + 1):
                c = count[b]
                if c == 0:
                    continue
                if c > 1:
                    push((sofar, c, bi + 1))
                bindex[b] = sofar
                sofar += c
                count[b] = 0
            assert sofar == si + n
            for x in txs:
                b = x[bi] if bi < len(x) else -1
                xs[bindex[b]] = x
                bindex[b] += 1

A little utility to generate "random" input may also be helpful:

    def build(n, m):
        from random import randrange
        xs = []
        for _ in range(n):
            xs.append(bytes(randrange(256) for _ in range(randrange(m))))
        return xs

Then, e.g.,

>>> xs = build(1000000, 20)
>>> ys = xs[:]
>>> rsort(xs)
>>> assert xs == sorted(ys)