[Elliot Gorokhovsky email@example.com]
Wow, Tim himself!
And Elliot himself! It's a party :-)
Regarding performance on semi-ordered data: we'll have to benchmark to see, but intuitively I imagine radix would meet Timsort because verifying that a list of strings is sorted takes Omega(nw) (which gives a lower bound on Timsort), where w is the word length.
Don't take that stuff so seriously. The number of character comparisons needed when comparing two strings is _typically_ small. Consecutive corresponding characters are compared only until the first non-equal pair is found. For example, if the alphabet is 256 characters, for random strings it only requires one character comparison 255 of 256 times on average (because there's only 1 chance in 256 that the first characters _are_ equal).
Radix sort is Theta(nw). So at least asymptotically it checks out.
Analogously, most-significant-byte-first radix sort stops on the pass after the longest common prefix is processed. But the real devils are in the details. More on that below.
I think if one uses the two-array algorithm, other semi-sortings can also be exploited, since the items get placed into their respective buckets in the order in which they appear in the list. So, for the example you gave, one pass would sort it correctly
Can't know that unless you specify which algorithm you have in mind. If you're talking about the one in the paper, _any_ sequence with values all from range(256) will be sorted in the first pass, because each value would get its own bucket.
(since the list has the property if x_1 and x_2 are in bucket b, x1 comes before x2 in the list, so x1 will also come before x2 in the bucket. Except possibly for one "border bucket" that includes n/2). And then it would just be Theta(nw/b) in each bucket to verify sorted.
I don't think you want to go there. If you're keen on radix sorting, stick to radix sorting until it's as good as you can make it. "To verify sorted" requires the (potentially) arbitrarily expensive kinds of comparisons you're trying to _eliminate_. That's not saying there's no possible advantage from doing some of each; it's warning that pursuing every off-target idea that comes up will leave you spinning in circles.
I mean honestly the cool thing about radix is that the best case for Timsort on strings, Omega(nw), is the worst case for radix!
"Details": I wrote a prototype radix sort along the lines of the paper's 2-array version, in Python. When there's "a real" O() improvement to be had, one can usually demonstrate it by a Python program incorporating the improvement even though it's fighting against (unimproved) C code. For example, I helped someone here craft a Python radix sort for integers that runs faster than list.sort(), although it required a list over 10,000,000 elements and a radix of at least 4096 ;-)
However, I'm not having that kind of luck with strings. Here's a detail that bites: as above, string comparisons _usually_ get out fast, after comparing just a few characters. So to construct a bad case for list.sort() in this respect, I want lots of strings with a long common prefix. For example, suppose we have a list of a million strings each of which starts with "A"*20000 (20 thousand "A"). String comparisons are then very expensive (each one requires at least 20001 character comparisons to resolve).
But that also makes the radix sort excruciatingly slow! On the first pass, it copies the million elements, puts them all in the same bucket, then copies them all back to exactly where they started from. Pass 2, exactly the same, but looking at the 2nd occurrence of "A" in each string. Ditto for pass 3, pass 4, and ... pass 20,000. By then we've copied a million elements 40,000 times - but still have exactly the same array we started with. The data-copying costs are killing it (yes, it's only copying pointers, but 40 billion pointer copies aren't cheap - and on my 64-bit box, amounts to copying 320 billion bytes).
On the same data, list.sort() is doing a great many character comparisons, but doing _far_ less pointer copying, and that leaves it running circles around the radix sort - list.sort() can't copy pointers more than about N*log2(N) times, and log2(1000000) is much smaller than 40,000.
To be complete, I'll attach the code I used. Note that the paper is only concerned with "C strings": 0 bytes are special, acting as string terminators. 0 bytes aren't special in Python, so we need 257 buckets. I'm also using Python 3, where strings are always Unicode, and code points (`ord()` results) can be as large as 1,114,111. So this radix sort can't work directly on Python 3 strings; instead it expects a list of `bytes` (or `bytearray`) objects.
def rsort(xs): count =  * 257 bindex = [None] * 257 stack = [(0, len(xs), 0)] # start index, length, byte index push = stack.append pop = stack.pop while stack: assert all(c == 0 for c in count) si, n, bi = pop() txs = xs[si: si + n] #if n < 30: # xs[si: si + n] = sorted(txs) # continue minbyte, maxbyte = 255, 0 for x in txs: b = x[bi] if bi < len(x) else -1 count[b] += 1 if count[b] == 1 and b >= 0: if b < minbyte: minbyte = b if b > maxbyte: maxbyte = b bindex[-1] = si sofar = si + count[-1] count[-1] = 0 for b in range(minbyte, maxbyte + 1): c = count[b] if c == 0: continue if c > 1: push((sofar, c, bi + 1)) bindex[b] = sofar sofar += c count[b] = 0 assert sofar == si + n for x in txs: b = x[bi] if bi < len(x) else -1 xs[bindex[b]] = x bindex[b] += 1
A little utility to generate "random" input may also be helpful:
def build(n, m): from random import randrange xs =  for _ in range(n): xs.append(bytes(randrange(256) for _ in range(randrange(m)))) return xs
xs = build(1000000, 20) ys = xs[:] rsort(xs) assert xs == sorted(ys)