To answer your question: I special-case unicode (strings), ints, and floats. I am working on special-casing tuples (can even be different types, just need homogeneity column-wise). The best speedups will be tuples of floats: it'll bypass three layers of useless checks.
If I run it without special-casing floats (just using tp->rich_compare) I only get like a 5-10% speedup. I'm working on rigorous benchmarks for all this stuff, will post a pdf along with the patch once it's all done. But it's certainly <10%. However, part of this is because my special case is really low-level; for strings I've actually found the opposite, using tp->richcompare gives me almost the same results as my special case compare, since it still has to PyUnicode_READY the strings (or whatever it's called).
Regarding generalization: the general technique for special-casing is you just substitute all type checks with 1 or 0 by applying the type assumption you're making. That's the only way to guarantee it's safe and compliant.
On Tue, Oct 11, 2016, 5:19 PM Jim J. Jewett firstname.lastname@example.org wrote:
Excellent. I'm surprised cache didn't save more, but less surprised than I was ... I hadn't realized that you were skipping the verifications in PyFloat_RichCompare as well. Does that generalize to other element types without exposing too much of the per-type internals to list.sort?
Oh ... and I appreciate your not quoting private email as a general courtesy, but I hereby give you permission if it was mine that was private. [Though I think your summary was better than a quote anyhow.]
On Oct 11, 2016 4:58 PM, "Elliot Gorokhovsky" < email@example.com> wrote:
So I got excited here. And the reason why is that I got those numbers *on Tim's benchmark*. When I got these kinds of numbers on my benchmarks, I figured there was probably a problem with they way I was timing, and certainly the gains couldn't be as extreme as they suggested. But this is on a benchmark that's already in the codebase!
Here is a detailed explanation of how to reproduce my results, and the circumstances that would lead them to be invalid:
To reproduce, just activate a virtualenv, and then clone https://github.com/embg/python-fast-listsort.git. Then python setup.py install and python sortperf.py.
Now let's look at what sortperf.py does and how it relates to Tim's benchmark at Lib/test/sortperf.py. If you diff the two, you'll find I made three changes:
- I added an import, "import fastlist". This obviously would not make
sorting twice faster.
- I changed the way it formats the output: I changed "fmt = ("%2s %7s" +
" %7s"*len(cases))" to "fmt = ("%2s %7s" + " %6s"*len(cases))". Again irrelevant.
- I changed the timing function
def doit_fast(L): t0 = time.perf_counter() L.fastsort() t1 = time.perf_counter() print("%6.2f" % (t1-t0), end=' ') flush()
def doit(L): F = FastList(L) f0 = time.perf_counter() F.fastsort() f1 = time.perf_counter() F = FastList(L) t0 = time.perf_counter() F.sort() t1 = time.perf_counter() print("%6.2f%%" % (100*(1-(f1-f0)/(t1-t0))), end=' ') flush()
So what we've shown is that (1) if you trust the existing sorting benchmark and (2) if my modification to doit() doesn't mess anything up (I leave this up to you to judge), then the measurements are as valid. Which is a pretty big deal (50% !!!!!!!), hence my overexcitement.
Now I'd like to respond to responses (the one I'm thinking of was off-list so I don't want to quote it) I've gotten questioning how it could be possible for such a small optimization, bypassing the typechecks, to possibly have such a large effect, even in theory. Here's my answer:
Let's ignore branch prediction and cache for now and just look at a high level. The cost of sorting is related to the cost of a single comparison, because the vast majority of our time (let's say certainly at least 90%, depending on the list) is spent in comparisons. So let's look at the cost of a comparison.
Without my optimization, comparisons for floats (that's what this benchmark looks at) go roughly like this:
- Test type of left and right for PyObject_RichCompare (which costs two
pointer dereferences) and compare them. "3 ops" (quotes because counting ops like this is pretty hand-wavy). "2 memory accesses".
- Get the address of the float compare method from
PyFloat_Type->tp_richcompare. "1 op". "1 memory access".
- Call the function whose address we just got. "1 op". "Basically 0
memory accesses because we count the stack stuff in that 1 op".
- Test type of left and right again in PyFloat_RichCompare and compare
both of them to PyFloat_Type. "4 ops". "2 memory accesses".
- Get floats from the PyObject* by calling PyFloat_AS_DOUBLE or whatever.
"2 ops". "2 memory accesses".
- Compare the floats and return. "2 ops".
Now let's tally the "cost" (sorry for use of quotes here, just trying to emphasize this is an intuitive, theoretical explanation for the numbers which doesn't take into account the hardware): "13 ops, 7 memory accesses".
Here's what it looks like in my code:
Call PyFloat_AS_DOUBLE on left and right. "2 ops". "2 memory acceses".
Compare the floats and return. "2 ops".
Tally: "4 ops, 2 memory accesses".
Now you can argue branch prediction alleviates a lot of this cost, since we're taking the same branches every time. But note that, branch prediction or not, we still have to do all of those memory acceses, and since they're pointers to places all over memory, they miss the cache basically every time (correct me if I'm wrong). So memory-wise, we really are doing something like a 7:2 ratio, and op-wise, perhaps not as bad because of branch prediction, but still, 13:4 is probably bad no matter what's going on in the hardware.
Now consider that something like 90% of our time is spent in those steps. Are my numbers really that unbelievable?
Thanks for everything, looking forward to writing this up as a nice latex doc with graphs and perf benchmarks and all the other rigorous goodies, as well as a special case cmp func for homogeneous tuples and a simple patch file,