
Hi, having rewritten Cython's dict iteration to use normal iteration, I ran a little benchmark for comparison and it turned out to be (algorithmically) faster by more than an order of magnitude for a dict with 1000 items. That's pretty decent, although it's still about 100x slower than in CPython. I'm sure and certain that there's much more of this low hanging fruit to find all over the place. Next, I ran callgrind over my little benchmark and it showed that the dominating part of the running time had shifted from 99% in PyDict_Next() to 56% in Py_DecRef(). Sadly, there wasn't much more to be read out of it, because the graph was just full of hex numbers instead of names. To be attributed to the JIT compiled code, I guess. Looking at Py_DecRef(), however, left me somewhat baffled. I would have expected this to be the most intensively tuned function in all of cpyext, but it even started with this comment: """ # XXX Optimize these functions and put them into macro definitions """ Ok, sounds reasonable. Should have been done already, though. Especially when I took a look at object.h and saw that the Py_DECREF() macro *always* calls into it. Another surprise. I had understood in previous discussions that the refcount emulation in cpyext only counts C references, which I consider a suitable design. (I guess something as common as Py_None uses the obvious optimisation of always having a ref-count > 1, right? At least when not debugging...) So I changed the macros to use an appropriate C-level implementation: """ #define Py_INCREF(ob) ((((PyObject *)ob)->ob_refcnt > 0) ? \ ((PyObject *)ob)->ob_refcnt++ : (Py_IncRef((PyObject *)ob))) #define Py_DECREF(ob) ((((PyObject *)ob)->ob_refcnt > 1) ? \ ((PyObject *)ob)->ob_refcnt-- : (Py_DecRef((PyObject *)ob))) #define Py_XINCREF(op) do { if ((op) == NULL) ; else Py_INCREF(op); \ } while (0) #define Py_XDECREF(op) do { if ((op) == NULL) ; else Py_DECREF(op); \ } while (0) """ to tell the C compiler that it doesn't actually need to call into PyPy in most cases (note that I didn't use any branch prediction macros, but that shouldn't change all that much anyway). This shaved off a couple of cycles from my iteration benchmark, but much less than I would have liked. My intuition tells me that this is because almost all objects that appear in the benchmark are actually short-lived in C space so that pretty much every Py_DECREF() on them kills them straight away and thus calls into Py_DecRef() anyway. To be verified with a better test. I think this could be fixed in some way by figuring out if there is at least one live reference to it inside of PyPy (which is usually the case for container iteration and parameters, for example), and as long as that's the case, add one to the refcount to keep C code from needing to call back into PyPy unnecessarily. A finaliser could do that, I guess. Ok, next, I looked at the Py_DecRef() implementation itself and found it to be rather long. Since I wasn't sure what the JIT would make of it, please don't shout at me if any of the following comments ("^^") make no sense. @cpython_api([PyObject], lltype.Void) def Py_DecRef(space, obj): if not obj: return ^^ this would best be handled right in C space to avoid unnecessary call overhead and to allow the C compiler to drop this test when possible. assert lltype.typeOf(obj) == PyObject obj.c_ob_refcnt -= 1 if DEBUG_REFCOUNT: debug_refcount("DECREF", obj, obj.c_ob_refcnt, frame_stackdepth=3) ^^ ok, I assume debug code will automatically be removed completely if obj.c_ob_refcnt == 0: state = space.fromcache(RefcountState) ptr = rffi.cast(ADDR, obj) if ptr not in state.py_objects_r2w: # this is a half-allocated object, lets call the deallocator # without modifying the r2w/w2r dicts _Py_Dealloc(space, obj) ^^ this looks like an emergency special case to me - shouldn't be on the fast path... else: w_obj = state.py_objects_r2w[ptr] ^^ here, I wonder if the JIT is smart enough to detect the double weakref dict lookup in the (I guess) 99% code path and folds it into one? ^^ I actually wonder why this needs to use a weakref dict instead of a normal dict. It's ref-counted after all, so we *know* there's a live reference to it until further notice. Are weakref dicts as fast as normal dicts in PyPy? del state.py_objects_r2w[ptr] w_type = space.type(w_obj) if not w_type.is_cpytype(): _Py_Dealloc(space, obj) del state.py_objects_w2r[w_obj] # if the object was a container for borrowed references state.delete_borrower(w_obj) ^^ in some place here, I would have expected the use of a freelist for the C space object representations in order to avoid unnecessary allocations, especially for things like short tuples etc. Is this hidden somewhere else? Or does PyPy's runtime handle this kind of caching internally somewhere? else: if not we_are_translated() and obj.c_ob_refcnt < 0: message = "Negative refcount for obj %s with type %s" % ( obj, rffi.charp2str(obj.c_ob_type.c_tp_name)) print >>sys.stderr, message assert False, message ^^ This will also vanish automatically, right? So, from a quick and shallow look, it seems to me that this code could be improved. By taking work off the fast paths and avoiding create-delete cycles, something as ubiquitous as this function could help in making C-API code run a lot faster. Stefan

Stefan Behnel, 26.02.2012 09:50:
when I took a look at object.h and saw that the Py_DECREF() macro *always* calls into it. Another surprise.
I had understood in previous discussions that the refcount emulation in cpyext only counts C references, which I consider a suitable design. (I guess something as common as Py_None uses the obvious optimisation of always having a ref-count > 1, right? At least when not debugging...)
So I changed the macros to use an appropriate C-level implementation:
""" #define Py_INCREF(ob) ((((PyObject *)ob)->ob_refcnt > 0) ? \ ((PyObject *)ob)->ob_refcnt++ : (Py_IncRef((PyObject *)ob)))
#define Py_DECREF(ob) ((((PyObject *)ob)->ob_refcnt > 1) ? \ ((PyObject *)ob)->ob_refcnt-- : (Py_DecRef((PyObject *)ob)))
#define Py_XINCREF(op) do { if ((op) == NULL) ; else Py_INCREF(op); \ } while (0)
#define Py_XDECREF(op) do { if ((op) == NULL) ; else Py_DECREF(op); \ } while (0) """
to tell the C compiler that it doesn't actually need to call into PyPy in most cases (note that I didn't use any branch prediction macros, but that shouldn't change all that much anyway). This shaved off a couple of cycles from my iteration benchmark, but much less than I would have liked. My intuition tells me that this is because almost all objects that appear in the benchmark are actually short-lived in C space so that pretty much every Py_DECREF() on them kills them straight away and thus calls into Py_DecRef() anyway. To be verified with a better test.
Ok, here's a stupid micro-benchmark for ref-counting: def bench(x): cdef int i for i in xrange(10000): a = x b = x c = x d = x e = x f = x g = x Leads to the obvious C code. :) (and yes, this will eventually stop actually being a benchmark in Cython...) When always calling into Py_IncRef() and Py_DecRef(), I get this $ pypy -m timeit -s 'from refcountbench import bench' 'bench(10)' 1000 loops, best of 3: 683 usec per loop With the macros above, I get this: $ pypy -m timeit -s 'from refcountbench import bench' 'bench(10)' 1000 loops, best of 3: 385 usec per loop So that's better by almost a factor of 2, just because the C compiler can handle most of the ref-counting internally once there is more than one C reference to an object. It will obviously be a lot less than that for real-world code, but I think it makes it clear enough that it's worth putting some effort into ways to avoid calling back and forth across the border for no good reason. Stefan

Hi Stefan, On Sun, Feb 26, 2012 at 09:50, Stefan Behnel <stefan_ml@behnel.de> wrote:
Looking at Py_DecRef(), however, left me somewhat baffled. I would have expected this to be the most intensively tuned function in all of cpyext, but it even started with this comment: (...)
Indeed, it's an obvious starting place if we want to optimize cpyext (which did not occur at all so far). You are welcome to try. Note that the JIT has nothing to do here: we cannot JIT any code written in C, and it makes no sense to apply the JIT on a short RPython callback alone. But because most of the code in module/cpyext/ is RPython code, it means it gets turned into equivalent C code statically, with (as you noticed) the debugging checks constant-folded away. The first thing to try would be to rethink how the PyPy object and the PyObject are linked together. Right now it's done with two (possibly weak) dictionaries, one for each direction. We can at least improve the situation by having a normal field in the PyObject pointing back to the PyPy object. This needs to be done carefully but can be done. The issue is that the GC needs to know about this field. It would probably require something like: allocate some GcArrays of PyObject structures (not pointers, directly PyObjects --- which have all the same size here, so it works). Use something like 100 PyObject structures per GcArray, and collect all the GcArrays in a global list. Use a freelist for dead entries. If you allocate each GcArray as "non-movable", then you can take pointers to the PyObjects and pass them to C code. As they are inside the regular GcArrays, they are GC-tracked and can contain a field that points back to the PyPy object. A bientôt, Armin.

Hi Armin, Armin Rigo, 26.02.2012 11:09:
On Sun, Feb 26, 2012 at 09:50, Stefan Behnel wrote:
Looking at Py_DecRef(), however, left me somewhat baffled. I would have expected this to be the most intensively tuned function in all of cpyext, but it even started with this comment: (...)
Indeed, it's an obvious starting place if we want to optimize cpyext (which did not occur at all so far). You are welcome to try.
Looks like it's worth it.
Note that the JIT has nothing to do here: we cannot JIT any code written in C
... obviously - and there's C compilers for that anyway.
and it makes no sense to apply the JIT on a short RPython callback alone.
I can't see why that would be so. Just looking at Py_DecRef(), I can see lots of places where cross-function runtime optimisations would make sense, for example. I'm sure a C compiler will have a hard time finding the right fast path in there.
But because most of the code in module/cpyext/ is RPython code, it means it gets turned into equivalent C code statically
Interesting. Given PyPy's reputation of taking tons of resources to build, I assume you apply WPA to the sources in order to map them to C? Then why wouldn't I get better traces from gdb and valgrind for the generated code? Is it just that the nightly builds lack debugging symbols?
The first thing to try would be to rethink how the PyPy object and the PyObject are linked together. Right now it's done with two (possibly weak) dictionaries, one for each direction. We can at least improve the situation by having a normal field in the PyObject pointing back to the PyPy object. This needs to be done carefully but can be done.
Based on my experience with lxml, such a change is absolutely worth any hassle.
The issue is that the GC needs to know about this field. It would probably require something like: allocate some GcArrays of PyObject structures (not pointers, directly PyObjects --- which have all the same size here, so it works). Use something like 100 PyObject structures per GcArray, and collect all the GcArrays in a global list. Use a freelist for dead entries. If you allocate each GcArray as "non-movable", then you can take pointers to the PyObjects and pass them to C code. As they are inside the regular GcArrays, they are GC-tracked and can contain a field that points back to the PyPy object.
Sounds like a good idea to me. Stefan

Hi, On Sun, Feb 26, 2012 at 12:31, Stefan Behnel <stefan_ml@behnel.de> wrote:
Interesting. Given PyPy's reputation of taking tons of resources to build, I assume you apply WPA to the sources in order to map them to C?
Yes. Please read more about it starting for example from here: http://doc.pypy.org/en/latest/architecture.html
Then why wouldn't I get better traces from gdb and valgrind for the generated code? Is it just that the nightly builds lack debugging symbols?
Yes. With a proper debugging build you get at least the intermediate C code. Not extremely readable but at least you can follow it. A bientôt, Armin.

Armin Rigo, 26.02.2012 11:09:
On Sun, Feb 26, 2012 at 09:50, Stefan Behnel wrote:
Looking at Py_DecRef(), however, left me somewhat baffled. I would have expected this to be the most intensively tuned function in all of cpyext, but it even started with this comment: (...)
Indeed, it's an obvious starting place if we want to optimize cpyext (which did not occur at all so far). You are welcome to try.
Wouldn't optimising and improving cpyext be a perfect topic for a GSoC student? Stefan
participants (2)
-
Armin Rigo
-
Stefan Behnel