So I got excited here. And the reason why is that I got those numbers *on Tim's benchmark*. When I got these kinds of numbers on my benchmarks, I figured there was probably a problem with they way I was timing, and certainly the gains couldn't be as extreme as they suggested. But this is on a benchmark that's already in the codebase!

Here is a detailed explanation of how to reproduce my results, and the circumstances that would lead them to be invalid:

******************************************

To reproduce, just activate a virtualenv, and then clone https://github.com/embg/python-fast-listsort.git. Then python setup.py install and python sortperf.py.

Now let's look at what sortperf.py does and how it relates to Tim's benchmark at Lib/test/sortperf.py. If you diff the two, you'll find I made three changes:

1. I added an import, "import fastlist". This obviously would not make sorting twice faster.

2. I changed the way it formats the output: I changed "fmt = ("%2s %7s" + " %7s"*len(cases))" to "fmt = ("%2s %7s" + " %6s"*len(cases))". Again irrelevant.

3. I changed the timing function

#from this

def doit_fast(L):
t0 = time.perf_counter()
L.fastsort()
t1 = time.perf_counter()
print("%6.2f" % (t1-t0), end=' ')
flush()

#to this

def doit(L):
F = FastList(L)
f0 = time.perf_counter()
F.fastsort()
f1 = time.perf_counter()
F = FastList(L)
t0 = time.perf_counter()
F.sort()
t1 = time.perf_counter()
print("%6.2f%%" % (100*(1-(f1-f0)/(t1-t0))), end=' ')
flush()

*******************************************

So what we've shown is that (1) if you trust the existing sorting benchmark and (2) if my modification to doit() doesn't mess anything up (I leave this up to you to judge), then the measurements are as valid. Which is a pretty big deal (50% !!!!!!!), hence my overexcitement.

****************************************

Now I'd like to respond to responses (the one I'm thinking of was off-list so I don't want to quote it) I've gotten questioning how it could be possible for such a small optimization, bypassing the typechecks, to possibly have such a large effect, even in theory. Here's my answer:

Let's ignore branch prediction and cache for now and just look at a high level. The cost of sorting is related to the cost of a single comparison, because the vast majority of our time (let's say certainly at least 90%, depending on the list) is spent in comparisons. So let's look at the cost of a comparison.

Without my optimization, comparisons for floats (that's what this benchmark looks at) go roughly like this:

1. Test type of left and right for PyObject_RichCompare (which costs two pointer dereferences) and compare them. "3 ops" (quotes because counting ops like this is pretty hand-wavy). "2 memory accesses".

2. Get the address of the float compare method from PyFloat_Type->tp_richcompare. "1 op". "1 memory access".

3. Call the function whose address we just got. "1 op". "Basically 0 memory accesses because we count the stack stuff in that 1 op".

4. Test type of left and right again in PyFloat_RichCompare and compare both of them to PyFloat_Type. "4 ops". "2 memory accesses".

5. Get floats from the PyObject* by calling PyFloat_AS_DOUBLE or whatever. "2 ops". "2 memory accesses".

6. Compare the floats and return. "2 ops".

Now let's tally the "cost" (sorry for use of quotes here, just trying to emphasize this is an intuitive, theoretical explanation for the numbers which doesn't take into account the hardware):
"13 ops, 7 memory accesses".

Here's what it looks like in my code:

1. Call PyFloat_AS_DOUBLE on left and right. "2 ops". "2 memory acceses".

2. Compare the floats and return. "2 ops".

Tally: "4 ops, 2 memory accesses".

Now you can argue branch prediction alleviates a lot of this cost, since we're taking the same branches every time. But note that, branch prediction or not, we still have to do all of those memory acceses, and since they're pointers to places all over memory, they miss the cache basically every time (correct me if I'm wrong). So memory-wise, we really are doing something like a 7:2 ratio, and op-wise, perhaps not as bad because of branch prediction, but still, 13:4 is probably bad no matter what's going on in the hardware.

Now consider that something like 90% of our time is spent in those steps. Are my numbers really that unbelievable?

Thanks for everything, looking forward to writing this up as a nice latex doc with graphs and perf benchmarks and all the other rigorous goodies, as well as a special case cmp func for homogeneous tuples and a simple patch file,

Elliot