
Stefan Behnel, 26.02.2012 09:50:
when I took a look at object.h and saw that the Py_DECREF() macro *always* calls into it. Another surprise.
I had understood in previous discussions that the refcount emulation in cpyext only counts C references, which I consider a suitable design. (I guess something as common as Py_None uses the obvious optimisation of always having a ref-count > 1, right? At least when not debugging...)
So I changed the macros to use an appropriate C-level implementation:
""" #define Py_INCREF(ob) ((((PyObject *)ob)->ob_refcnt > 0) ? \ ((PyObject *)ob)->ob_refcnt++ : (Py_IncRef((PyObject *)ob)))
#define Py_DECREF(ob) ((((PyObject *)ob)->ob_refcnt > 1) ? \ ((PyObject *)ob)->ob_refcnt-- : (Py_DecRef((PyObject *)ob)))
#define Py_XINCREF(op) do { if ((op) == NULL) ; else Py_INCREF(op); \ } while (0)
#define Py_XDECREF(op) do { if ((op) == NULL) ; else Py_DECREF(op); \ } while (0) """
to tell the C compiler that it doesn't actually need to call into PyPy in most cases (note that I didn't use any branch prediction macros, but that shouldn't change all that much anyway). This shaved off a couple of cycles from my iteration benchmark, but much less than I would have liked. My intuition tells me that this is because almost all objects that appear in the benchmark are actually short-lived in C space so that pretty much every Py_DECREF() on them kills them straight away and thus calls into Py_DecRef() anyway. To be verified with a better test.
Ok, here's a stupid micro-benchmark for ref-counting: def bench(x): cdef int i for i in xrange(10000): a = x b = x c = x d = x e = x f = x g = x Leads to the obvious C code. :) (and yes, this will eventually stop actually being a benchmark in Cython...) When always calling into Py_IncRef() and Py_DecRef(), I get this $ pypy -m timeit -s 'from refcountbench import bench' 'bench(10)' 1000 loops, best of 3: 683 usec per loop With the macros above, I get this: $ pypy -m timeit -s 'from refcountbench import bench' 'bench(10)' 1000 loops, best of 3: 385 usec per loop So that's better by almost a factor of 2, just because the C compiler can handle most of the ref-counting internally once there is more than one C reference to an object. It will obviously be a lot less than that for real-world code, but I think it makes it clear enough that it's worth putting some effort into ways to avoid calling back and forth across the border for no good reason. Stefan