[pypy-dev] Py_DecRef() in cpyext

26 Feb 2012

      Hi,

having rewritten Cython's dict iteration to use normal iteration, I ran a
little benchmark for comparison and it turned out to be (algorithmically)
faster by more than an order of magnitude for a dict with 1000 items.
That's pretty decent, although it's still about 100x slower than in
CPython. I'm sure and certain that there's much more of this low hanging
fruit to find all over the place.

Next, I ran callgrind over my little benchmark and it showed that the
dominating part of the running time had shifted from 99% in PyDict_Next()
to 56% in Py_DecRef(). Sadly, there wasn't much more to be read out of it,
because the graph was just full of hex numbers instead of names. To be
attributed to the JIT compiled code, I guess.

Looking at Py_DecRef(), however, left me somewhat baffled. I would have
expected this to be the most intensively tuned function in all of cpyext,
but it even started with this comment:

"""
# XXX Optimize these functions and put them into macro definitions
"""

Ok, sounds reasonable. Should have been done already, though. Especially
when I took a look at object.h and saw that the Py_DECREF() macro *always*
calls into it. Another surprise.

I had understood in previous discussions that the refcount emulation in
cpyext only counts C references, which I consider a suitable design. (I
guess something as common as Py_None uses the obvious optimisation of
always having a ref-count > 1, right? At least when not debugging...)

So I changed the macros to use an appropriate C-level implementation:

"""
#define Py_INCREF(ob)  ((((PyObject *)ob)->ob_refcnt > 0) ? \
     ((PyObject *)ob)->ob_refcnt++ : (Py_IncRef((PyObject *)ob)))

#define Py_DECREF(ob)  ((((PyObject *)ob)->ob_refcnt > 1) ? \
     ((PyObject *)ob)->ob_refcnt-- : (Py_DecRef((PyObject *)ob)))

#define Py_XINCREF(op) do { if ((op) == NULL) ; else Py_INCREF(op); \
                          } while (0)

#define Py_XDECREF(op) do { if ((op) == NULL) ; else Py_DECREF(op); \
                          } while (0)
"""

to tell the C compiler that it doesn't actually need to call into PyPy in
most cases (note that I didn't use any branch prediction macros, but that
shouldn't change all that much anyway). This shaved off a couple of cycles
from my iteration benchmark, but much less than I would have liked. My
intuition tells me that this is because almost all objects that appear in
the benchmark are actually short-lived in C space so that pretty much every
Py_DECREF() on them kills them straight away and thus calls into
Py_DecRef() anyway. To be verified with a better test.

I think this could be fixed in some way by figuring out if there is at
least one live reference to it inside of PyPy (which is usually the case
for container iteration and parameters, for example), and as long as that's
the case, add one to the refcount to keep C code from needing to call back
into PyPy unnecessarily. A finaliser could do that, I guess.

Ok, next, I looked at the Py_DecRef() implementation itself and found it to
be rather long. Since I wasn't sure what the JIT would make of it, please
don't shout at me if any of the following comments ("^^") make no sense.

@cpython_api([PyObject], lltype.Void)
def Py_DecRef(space, obj):
    if not obj:
        return

^^ this would best be handled right in C space to avoid unnecessary call
overhead and to allow the C compiler to drop this test when possible.

    assert lltype.typeOf(obj) == PyObject

    obj.c_ob_refcnt -= 1
    if DEBUG_REFCOUNT:
        debug_refcount("DECREF", obj, obj.c_ob_refcnt, frame_stackdepth=3)

^^ ok, I assume debug code will automatically be removed completely

    if obj.c_ob_refcnt == 0:
        state = space.fromcache(RefcountState)
        ptr = rffi.cast(ADDR, obj)
        if ptr not in state.py_objects_r2w:
            # this is a half-allocated object, lets call the deallocator
            # without modifying the r2w/w2r dicts
            _Py_Dealloc(space, obj)

^^ this looks like an emergency special case to me - shouldn't be on the
fast path...

        else:
            w_obj = state.py_objects_r2w[ptr]

^^ here, I wonder if the JIT is smart enough to detect the double weakref
dict lookup in the (I guess) 99% code path and folds it into one?

^^ I actually wonder why this needs to use a weakref dict instead of a
normal dict. It's ref-counted after all, so we *know* there's a live
reference to it until further notice. Are weakref dicts as fast as normal
dicts in PyPy?

            del state.py_objects_r2w[ptr]
            w_type = space.type(w_obj)
            if not w_type.is_cpytype():
                _Py_Dealloc(space, obj)
            del state.py_objects_w2r[w_obj]
            # if the object was a container for borrowed references
            state.delete_borrower(w_obj)

^^ in some place here, I would have expected the use of a freelist for the
C space object representations in order to avoid unnecessary allocations,
especially for things like short tuples etc. Is this hidden somewhere else?
Or does PyPy's runtime handle this kind of caching internally somewhere?

    else:
        if not we_are_translated() and obj.c_ob_refcnt < 0:
            message = "Negative refcount for obj %s with type %s" % (
                obj, rffi.charp2str(obj.c_ob_type.c_tp_name))
            print >>sys.stderr, message
            assert False, message

^^ This will also vanish automatically, right?

So, from a quick and shallow look, it seems to me that this code could be
improved. By taking work off the fast paths and avoiding create-delete
cycles, something as ubiquitous as this function could help in making C-API
code run a lot faster.

Stefan

[pypy-dev] Py_DecRef() in cpyext

Stefan Behnel