Re: [Python-ideas] INSANE FLOAT PERFORMANCE!!!

11 Oct 2016

      To answer your question: I special-case unicode (strings), ints, and
floats. I am working on special-casing tuples (can even be different types,
just need homogeneity column-wise). The best speedups will be tuples of
floats: it'll bypass three layers of useless checks.

If I run it without special-casing floats (just using tp->rich_compare) I
only get like a 5-10% speedup. I'm working on rigorous benchmarks for all
this stuff, will post a pdf along with the patch once it's all done. But
it's certainly <10%. However, part of this is because my special case is
really low-level; for strings I've actually found the opposite, using
tp->richcompare gives me almost the same results as my special case
compare, since it still has to PyUnicode_READY the strings (or whatever
it's called).

Regarding generalization: the general technique for special-casing is you
just substitute all type checks with 1 or 0 by applying the type assumption
you're making. That's the only way to guarantee it's safe and compliant.

Elliot

On Tue, Oct 11, 2016, 5:19 PM Jim J. Jewett <jimjjewett@gmail.com> wrote:
...
Excellent.
I'm surprised cache didn't save more, but less surprised than I was ... I
hadn't realized that you were skipping the verifications in
PyFloat_RichCompare as well.  Does that generalize to other element types
without exposing too much of the per-type internals to list.sort?
Oh ... and I appreciate your not quoting private email as a general
courtesy, but I hereby give you permission if it was mine that was private.
[Though I think your summary was better than a quote anyhow.]
-jJ
On Oct 11, 2016 4:58 PM, "Elliot Gorokhovsky" <
elliot.gorokhovsky@gmail.com> wrote:
So I got excited here. And the reason why is that I got those numbers *on
Tim's benchmark*. When I got these kinds of numbers on my benchmarks, I
figured there was probably a problem with they way I was timing, and
certainly the gains couldn't be as extreme as they suggested. But this is
on a benchmark that's already in the codebase!
Here is a detailed explanation of how to reproduce my results, and the
circumstances that would lead them to be invalid:
******************************************
To reproduce, just activate a virtualenv, and then clone
https://github.com/embg/python-fast-listsort.git. Then python setup.py
install and python sortperf.py.
Now let's look at what sortperf.py does and how it relates to Tim's
benchmark at Lib/test/sortperf.py. If you diff the two, you'll find I made
three changes:
1. I added an import, "import fastlist". This obviously would not make
sorting twice faster.
2. I changed the way it formats the output: I changed "fmt = ("%2s %7s" +
" %7s"*len(cases))" to "fmt = ("%2s %7s" + " %6s"*len(cases))". Again
irrelevant.
3. I changed the timing function
#from this
def doit_fast(L):
    t0 = time.perf_counter()
    L.fastsort()
    t1 = time.perf_counter()
    print("%6.2f" % (t1-t0), end=' ')
    flush()
#to this
def doit(L):
    F = FastList(L)
    f0 = time.perf_counter()
    F.fastsort()
    f1 = time.perf_counter()
    F = FastList(L)
    t0 = time.perf_counter()
    F.sort()
    t1 = time.perf_counter()
    print("%6.2f%%" % (100*(1-(f1-f0)/(t1-t0))), end=' ')
    flush()
*******************************************
So what we've shown is that (1) if you trust the existing sorting
benchmark and (2) if my modification to doit() doesn't mess anything up (I
leave this up to you to judge), then the measurements are as valid. Which
is a pretty big deal (50% !!!!!!!), hence my overexcitement.
****************************************
Now I'd like to respond to responses (the one I'm thinking of was off-list
so I don't want to quote it) I've gotten questioning how it could be
possible for such a small optimization, bypassing the typechecks, to
possibly have such a large effect, even in theory. Here's my answer:
Let's ignore branch prediction and cache for now and just look at a high
level. The cost of sorting is related to the cost of a single comparison,
because the vast majority of our time (let's say certainly at least 90%,
depending on the list) is spent in comparisons. So let's look at the cost
of a comparison.
Without my optimization, comparisons for floats (that's what this
benchmark looks at) go roughly like this:
1. Test type of left and right for PyObject_RichCompare (which costs two
pointer dereferences) and compare them. "3 ops" (quotes because counting
ops like this is pretty hand-wavy). "2 memory accesses".
2. Get the address of the float compare method from
PyFloat_Type->tp_richcompare. "1 op". "1 memory access".
3. Call the function whose address we just got. "1 op". "Basically 0
memory accesses because we count the stack stuff in that 1 op".
4. Test type of left and right again in PyFloat_RichCompare and compare
both of them to PyFloat_Type. "4 ops". "2 memory accesses".
5. Get floats from the PyObject* by calling PyFloat_AS_DOUBLE or whatever.
"2 ops". "2 memory accesses".
6. Compare the floats and return. "2 ops".
Now let's tally the "cost" (sorry for use of quotes here, just trying to
emphasize this is an intuitive, theoretical explanation for the numbers
which doesn't take into account the hardware):
"13 ops, 7 memory accesses".
Here's what it looks like in my code:
1. Call PyFloat_AS_DOUBLE on left and right. "2 ops". "2 memory acceses".
2. Compare the floats and return. "2 ops".
Tally: "4 ops, 2 memory accesses".
Now you can argue branch prediction alleviates a lot of this cost, since
we're taking the same branches every time. But note that, branch prediction
or not, we still have to do all of those memory acceses, and since they're
pointers to places all over memory, they miss the cache basically every
time (correct me if I'm wrong). So memory-wise, we really are doing
something like a 7:2 ratio, and op-wise, perhaps not as bad because of
branch prediction, but still, 13:4 is probably bad no matter what's going
on in the hardware.
Now consider that something like 90% of our time is spent in those steps.
Are my numbers really that unbelievable?
Thanks for everything, looking forward to writing this up as a nice latex
doc with graphs and perf benchmarks and all the other rigorous goodies, as
well as a special case cmp func for homogeneous tuples and a simple patch
file,
Elliot