Re: [pypy-dev] Py_DecRef() in cpyext

26 Feb 2012

      Stefan Behnel, 26.02.2012 09:50:
...
when I took a look at object.h and saw that the Py_DECREF() macro *always*
calls into it. Another surprise.
I had understood in previous discussions that the refcount emulation in
cpyext only counts C references, which I consider a suitable design. (I
guess something as common as Py_None uses the obvious optimisation of
always having a ref-count > 1, right? At least when not debugging...)
So I changed the macros to use an appropriate C-level implementation:
"""
#define Py_INCREF(ob)  ((((PyObject *)ob)->ob_refcnt > 0) ? \
     ((PyObject *)ob)->ob_refcnt++ : (Py_IncRef((PyObject *)ob)))
#define Py_DECREF(ob)  ((((PyObject *)ob)->ob_refcnt > 1) ? \
     ((PyObject *)ob)->ob_refcnt-- : (Py_DecRef((PyObject *)ob)))
#define Py_XINCREF(op) do { if ((op) == NULL) ; else Py_INCREF(op); \
                          } while (0)
#define Py_XDECREF(op) do { if ((op) == NULL) ; else Py_DECREF(op); \
                          } while (0)
"""
to tell the C compiler that it doesn't actually need to call into PyPy in
most cases (note that I didn't use any branch prediction macros, but that
shouldn't change all that much anyway). This shaved off a couple of cycles
from my iteration benchmark, but much less than I would have liked. My
intuition tells me that this is because almost all objects that appear in
the benchmark are actually short-lived in C space so that pretty much every
Py_DECREF() on them kills them straight away and thus calls into
Py_DecRef() anyway. To be verified with a better test.
Ok, here's a stupid micro-benchmark for ref-counting:

def bench(x):
    cdef int i
    for i in xrange(10000):
        a = x
        b = x
        c = x
        d = x
        e = x
        f = x
        g = x

Leads to the obvious C code. :) (and yes, this will eventually stop
actually being a benchmark in Cython...)

When always calling into Py_IncRef() and Py_DecRef(), I get this

$ pypy -m timeit -s 'from refcountbench import bench' 'bench(10)'
1000 loops, best of 3: 683 usec per loop

With the macros above, I get this:

$ pypy -m timeit -s 'from refcountbench import bench' 'bench(10)'
1000 loops, best of 3: 385 usec per loop

So that's better by almost a factor of 2, just because the C compiler can
handle most of the ref-counting internally once there is more than one C
reference to an object. It will obviously be a lot less than that for
real-world code, but I think it makes it clear enough that it's worth
putting some effort into ways to avoid calling back and forth across the
border for no good reason.

Stefan