[Python-ideas] INSANE FLOAT PERFORMANCE!!!

Elliot Gorokhovsky elliot.gorokhovsky at gmail.com
Tue Oct 11 16:58:07 EDT 2016


So I got excited here. And the reason why is that I got those numbers *on
Tim's benchmark*. When I got these kinds of numbers on my benchmarks, I
figured there was probably a problem with they way I was timing, and
certainly the gains couldn't be as extreme as they suggested. But this is
on a benchmark that's already in the codebase!


Here is a detailed explanation of how to reproduce my results, and the
circumstances that would lead them to be invalid:

******************************************

To reproduce, just activate a virtualenv, and then clone
https://github.com/embg/python-fast-listsort.git. Then python setup.py
install and python sortperf.py.


Now let's look at what sortperf.py does and how it relates to Tim's
benchmark at Lib/test/sortperf.py. If you diff the two, you'll find I made
three changes:


1. I added an import, "import fastlist". This obviously would not make
sorting twice faster.


2. I changed the way it formats the output: I changed "fmt = ("%2s %7s" + "
%7s"*len(cases))" to "fmt = ("%2s %7s" + " %6s"*len(cases))". Again
irrelevant.


3. I changed the timing function

#from this


def doit_fast(L):
    t0 = time.perf_counter()
    L.fastsort()
    t1 = time.perf_counter()
    print("%6.2f" % (t1-t0), end=' ')
    flush()



#to this


def doit(L):
    F = FastList(L)
    f0 = time.perf_counter()
    F.fastsort()
    f1 = time.perf_counter()
    F = FastList(L)
    t0 = time.perf_counter()
    F.sort()
    t1 = time.perf_counter()
    print("%6.2f%%" % (100*(1-(f1-f0)/(t1-t0))), end=' ')
    flush()


*******************************************

So what we've shown is that (1) if you trust the existing sorting benchmark
and (2) if my modification to doit() doesn't mess anything up (I leave this
up to you to judge), then the measurements are as valid. Which is a pretty
big deal (50% !!!!!!!), hence my overexcitement.

****************************************


Now I'd like to respond to responses (the one I'm thinking of was off-list
so I don't want to quote it) I've gotten questioning how it could be
possible for such a small optimization, bypassing the typechecks, to
possibly have such a large effect, even in theory. Here's my answer:

Let's ignore branch prediction and cache for now and just look at a high
level. The cost of sorting is related to the cost of a single comparison,
because the vast majority of our time (let's say certainly at least 90%,
depending on the list) is spent in comparisons. So let's look at the cost
of a comparison.

Without my optimization, comparisons for floats (that's what this benchmark
looks at) go roughly like this:

1. Test type of left and right for PyObject_RichCompare (which costs two
pointer dereferences) and compare them. "3 ops" (quotes because counting
ops like this is pretty hand-wavy). "2 memory accesses".

2. Get the address of the float compare method from
PyFloat_Type->tp_richcompare. "1 op". "1 memory access".

3. Call the function whose address we just got. "1 op". "Basically 0 memory
accesses because we count the stack stuff in that 1 op".

4. Test type of left and right again in PyFloat_RichCompare and compare
both of them to PyFloat_Type. "4 ops". "2 memory accesses".

5. Get floats from the PyObject* by calling PyFloat_AS_DOUBLE or whatever.
"2 ops". "2 memory accesses".

6. Compare the floats and return. "2 ops".

Now let's tally the "cost" (sorry for use of quotes here, just trying to
emphasize this is an intuitive, theoretical explanation for the numbers
which doesn't take into account the hardware):
"13 ops, 7 memory accesses".

Here's what it looks like in my code:

1. Call PyFloat_AS_DOUBLE on left and right. "2 ops". "2 memory acceses".

2. Compare the floats and return. "2 ops".

Tally: "4 ops, 2 memory accesses".

Now you can argue branch prediction alleviates a lot of this cost, since
we're taking the same branches every time. But note that, branch prediction
or not, we still have to do all of those memory acceses, and since they're
pointers to places all over memory, they miss the cache basically every
time (correct me if I'm wrong). So memory-wise, we really are doing
something like a 7:2 ratio, and op-wise, perhaps not as bad because of
branch prediction, but still, 13:4 is probably bad no matter what's going
on in the hardware.

Now consider that something like 90% of our time is spent in those steps.
Are my numbers really that unbelievable?

Thanks for everything, looking forward to writing this up as a nice latex
doc with graphs and perf benchmarks and all the other rigorous goodies, as
well as a special case cmp func for homogeneous tuples and a simple patch
file,

Elliot
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20161011/c821beef/attachment.html>


More information about the Python-ideas mailing list