
Hi, having rewritten Cython's dict iteration to use normal iteration, I ran a little benchmark for comparison and it turned out to be (algorithmically) faster by more than an order of magnitude for a dict with 1000 items. That's pretty decent, although it's still about 100x slower than in CPython. I'm sure and certain that there's much more of this low hanging fruit to find all over the place. Next, I ran callgrind over my little benchmark and it showed that the dominating part of the running time had shifted from 99% in PyDict_Next() to 56% in Py_DecRef(). Sadly, there wasn't much more to be read out of it, because the graph was just full of hex numbers instead of names. To be attributed to the JIT compiled code, I guess. Looking at Py_DecRef(), however, left me somewhat baffled. I would have expected this to be the most intensively tuned function in all of cpyext, but it even started with this comment: """ # XXX Optimize these functions and put them into macro definitions """ Ok, sounds reasonable. Should have been done already, though. Especially when I took a look at object.h and saw that the Py_DECREF() macro *always* calls into it. Another surprise. I had understood in previous discussions that the refcount emulation in cpyext only counts C references, which I consider a suitable design. (I guess something as common as Py_None uses the obvious optimisation of always having a ref-count > 1, right? At least when not debugging...) So I changed the macros to use an appropriate C-level implementation: """ #define Py_INCREF(ob) ((((PyObject *)ob)->ob_refcnt > 0) ? \ ((PyObject *)ob)->ob_refcnt++ : (Py_IncRef((PyObject *)ob))) #define Py_DECREF(ob) ((((PyObject *)ob)->ob_refcnt > 1) ? \ ((PyObject *)ob)->ob_refcnt-- : (Py_DecRef((PyObject *)ob))) #define Py_XINCREF(op) do { if ((op) == NULL) ; else Py_INCREF(op); \ } while (0) #define Py_XDECREF(op) do { if ((op) == NULL) ; else Py_DECREF(op); \ } while (0) """ to tell the C compiler that it doesn't actually need to call into PyPy in most cases (note that I didn't use any branch prediction macros, but that shouldn't change all that much anyway). This shaved off a couple of cycles from my iteration benchmark, but much less than I would have liked. My intuition tells me that this is because almost all objects that appear in the benchmark are actually short-lived in C space so that pretty much every Py_DECREF() on them kills them straight away and thus calls into Py_DecRef() anyway. To be verified with a better test. I think this could be fixed in some way by figuring out if there is at least one live reference to it inside of PyPy (which is usually the case for container iteration and parameters, for example), and as long as that's the case, add one to the refcount to keep C code from needing to call back into PyPy unnecessarily. A finaliser could do that, I guess. Ok, next, I looked at the Py_DecRef() implementation itself and found it to be rather long. Since I wasn't sure what the JIT would make of it, please don't shout at me if any of the following comments ("^^") make no sense. @cpython_api([PyObject], lltype.Void) def Py_DecRef(space, obj): if not obj: return ^^ this would best be handled right in C space to avoid unnecessary call overhead and to allow the C compiler to drop this test when possible. assert lltype.typeOf(obj) == PyObject obj.c_ob_refcnt -= 1 if DEBUG_REFCOUNT: debug_refcount("DECREF", obj, obj.c_ob_refcnt, frame_stackdepth=3) ^^ ok, I assume debug code will automatically be removed completely if obj.c_ob_refcnt == 0: state = space.fromcache(RefcountState) ptr = rffi.cast(ADDR, obj) if ptr not in state.py_objects_r2w: # this is a half-allocated object, lets call the deallocator # without modifying the r2w/w2r dicts _Py_Dealloc(space, obj) ^^ this looks like an emergency special case to me - shouldn't be on the fast path... else: w_obj = state.py_objects_r2w[ptr] ^^ here, I wonder if the JIT is smart enough to detect the double weakref dict lookup in the (I guess) 99% code path and folds it into one? ^^ I actually wonder why this needs to use a weakref dict instead of a normal dict. It's ref-counted after all, so we *know* there's a live reference to it until further notice. Are weakref dicts as fast as normal dicts in PyPy? del state.py_objects_r2w[ptr] w_type = space.type(w_obj) if not w_type.is_cpytype(): _Py_Dealloc(space, obj) del state.py_objects_w2r[w_obj] # if the object was a container for borrowed references state.delete_borrower(w_obj) ^^ in some place here, I would have expected the use of a freelist for the C space object representations in order to avoid unnecessary allocations, especially for things like short tuples etc. Is this hidden somewhere else? Or does PyPy's runtime handle this kind of caching internally somewhere? else: if not we_are_translated() and obj.c_ob_refcnt < 0: message = "Negative refcount for obj %s with type %s" % ( obj, rffi.charp2str(obj.c_ob_type.c_tp_name)) print >>sys.stderr, message assert False, message ^^ This will also vanish automatically, right? So, from a quick and shallow look, it seems to me that this code could be improved. By taking work off the fast paths and avoiding create-delete cycles, something as ubiquitous as this function could help in making C-API code run a lot faster. Stefan