
Hi, I've been working with one of the modules in the Python benchmark suite, namely nbody, and tried to make it run a little faster when compiled with Cython in PyPy. I managed to get a massive speed-up by avoiding some borrowed references during list iteration and using PySequence_GetItem() instead, but now some 50-60% of the runtime are spent in Py_DecRef(). I tried debugging into it, but so far, all my attempts to get anything useful out of PyPy or cpyext have failed. I applied this obvious change """ diff -r 20258fbf10d0 pypy/module/cpyext/pyobject.py --- a/pypy/module/cpyext/pyobject.py Sun Jul 01 10:02:26 2012 +0200 +++ b/pypy/module/cpyext/pyobject.py Tue Jul 03 11:18:22 2012 +0200 @@ -340,13 +340,13 @@ if obj.c_ob_refcnt == 0: state = space.fromcache(RefcountState) ptr = rffi.cast(ADDR, obj) - if ptr not in state.py_objects_r2w: + try: + w_obj = state.py_objects_r2w.pop(ptr) + except KeyError: # this is a half-allocated object, lets call the deallocator # without modifying the r2w/w2r dicts _Py_Dealloc(space, obj) else: - w_obj = state.py_objects_r2w[ptr] - del state.py_objects_r2w[ptr] w_type = space.type(w_obj) if not w_type.is_cpytype(): _Py_Dealloc(space, obj) """ and it gave me a couple of percent, but beyond that, I'd have to know what paths are actually taken here and where most of the time is spent. Running it through callgrind yields nothing helpful, so I tried running it in plain Python. That failed with an OverflowError in ll2ctypes.py: """ Traceback (most recent call last): File "pypy-1.9/pypy/bin/py.py", line 187, in <module> sys.exit(main_(sys.argv)) File "pypy-1.9/pypy/bin/py.py", line 86, in main_ space = option.make_objspace(config) File "pypy-1.9/pypy/tool/option.py", line 45, in make_objspace None, None, ['Space']) File "pypy-1.9/pypy/objspace/std/__init__.py", line 1, in <module> from pypy.objspace.std.objspace import StdObjSpace File "pypy-1.9/pypy/objspace/std/objspace.py", line 18, in <module> from pypy.objspace.std.complexobject import W_ComplexObject File "pypy-1.9/pypy/objspace/std/complexobject.py", line 6, in <module> from pypy.objspace.std.floatobject import W_FloatObject, _hash_float File "pypy-1.9/pypy/objspace/std/floatobject.py", line 48, in <module> class W_FloatObject(W_AbstractFloatObject): File "pypy-1.9/pypy/objspace/std/floatobject.py", line 51, in W_FloatObject from pypy.objspace.std.floattype import float_typedef as typedef File "pypy-1.9/pypy/objspace/std/floattype.py", line 87, in <module> _double_format, _float_format = detect_floatformat() File "pypy-1.9/pypy/objspace/std/floattype.py", line 64, in detect_floatformat rffi.cast(rffi.DOUBLEP, buf)[0] = 9006104071832581.0 File "pypy-1.9/pypy/rpython/lltypesystem/lltype.py", line 1219, in __setitem__ self._obj.setitem(i, val) File "pypy-1.9/pypy/rpython/lltypesystem/ll2ctypes.py", line 620, in setitem self._storage.contents._setitem(index, value, boundscheck=False) File "pypy-1.9/pypy/rpython/lltypesystem/ll2ctypes.py", line 279, in _setitem items = self._indexable(index) File "pypy-1.9/pypy/rpython/lltypesystem/ll2ctypes.py", line 259, in _indexable PtrType = self._get_ptrtype() File "pypy-1.9/pypy/rpython/lltypesystem/ll2ctypes.py", line 255, in _get_ptrtype raise e OverflowError: array too large """ No idea what this is supposed to tell me, except that I apparently hit yet another bug in PyPy. Now, having wasted enough time with this, I'd be happy if someone else could take over. I put the C code here that I used: http://cython.org/nbodybench.tar.bz2 You can build it without Cython, just run the included setupdu.py script over it and call pypy -c 'import nbody; print(nbody.test_nbody(1))' Then try to profile that. Stefan

Hi, 2012/7/3 Stefan Behnel <stefan_ml@behnel.de>
Don't forget that cpyext reference counts are quite different from CPython: PySequence_GetItem() needs to *create* a PyObject structure, and the returned object has a refcount of 1. Then Py_DECREF() will really *deallocate* the PyObject structure... This is quite more expensive than the simple refcount increment/decrement done by CPython.
OverflowError: array too large
Looks like a ctypes bug to me. Which OS, Python, etc. are you using? -- Amaury Forgeot d'Arc

Amaury Forgeot d'Arc, 03.07.2012 14:55:
Sure. And using PySequence_GetItem() is still some 10x faster in PyPy than taking a borrowed reference using PyList_GET_ITEM() and then calling Py_INCREF() on it, which is what Cython does in CPython because it's way faster there. The overhead of borrowed references is seriously huge in PyPy. BTW, are PyObject structures currently cached in a free-list somewhere? That would be really helpful for the iteration performance.
Ah - totally, sure. I accidentally ran the system Py2.5 on 64bit Linux. Running it with Py2.7 fixes this specific problem, thanks for the hint! Although it now names the extension module "nbody.so" instead of "nbody.pypy-19.so". Comprend qui peut ... After figuring out that I was supposed to enable cpyext manually and running strace to see what extension module name it is actually looking for, I failed to make it load the module it just built regardless of how I named it, so I tried building it within the same run as follows: pypy/bin/py.py --withmod-cpyext -c 'import setup; import nbody; \ nbody.test_nbody(1)' build_ext -i Doing this, it then fails with the following error: """Traceback (most recent call last): File ".../pypy/bin/py.py", line 187, in <module> sys.exit(main_(sys.argv)) File ".../pypy/bin/py.py", line 158, in main_ verbose=interactiveconfig.verbose): File ".../pypy/interpreter/main.py", line 103, in run_toplevel f() File ".../pypy/bin/py.py", line 133, in doit main.run_string(command, space=space) File ".../pypy/interpreter/main.py", line 59, in run_string _run_eval_string(source, filename, space, False) File ".../pypy/interpreter/main.py", line 48, in _run_eval_string retval = pycode.exec_code(space, w_globals, w_globals) File ".../pypy/interpreter/eval.py", line 33, in exec_code return frame.run() File ".../pypy/interpreter/pyframe.py", line 141, in run return self.execute_frame() File ".../pypy/interpreter/pyframe.py", line 175, in execute_frame executioncontext) File ".../pypy/interpreter/pyopcode.py", line 84, in dispatch next_instr = self.handle_bytecode(co_code, next_instr, ec) File ".../pypy/interpreter/pyopcode.py", line 107, in handle_bytecode rstackovf.check_stack_overflow() File ".../pypy/interpreter/pyopcode.py", line 90, in handle_bytecode next_instr = self.dispatch_bytecode(co_code, next_instr, ec) File ".../pypy/interpreter/pyopcode.py", line 265, in dispatch_bytecode res = meth(oparg, next_instr) File ".../pypy/interpreter/pyopcode.py", line 834, in IMPORT_NAME w_locals, w_fromlist) File ".../pypy/interpreter/baseobjspace.py", line 986, in call_function return w_func.funccall(*args_w) File ".../pypy/interpreter/function.py", line 111, in funccall return self.call_args(Arguments(self.space, list(args_w))) File ".../pypy/interpreter/function.py", line 59, in call_args return self.getcode().funcrun(self, args) File ".../pypy/interpreter/gateway.py", line 614, in funcrun return BuiltinCode.funcrun_obj(self, func, None, args) File ".../pypy/interpreter/gateway.py", line 631, in funcrun_obj self.handle_exception(space, e) File ".../pypy/interpreter/gateway.py", line 648, in handle_exception rstackovf.check_stack_overflow() File ".../pypy/interpreter/gateway.py", line 622, in funcrun_obj w_result = activation._run(space, scope_w) File "<6076-codegen .../pypy/tool/sourcetools.py:174>", line 3, in _run File ".../pypy/module/imp/importing.py", line 271, in importhook fromlist_w, tentative=True) File ".../pypy/module/imp/importing.py", line 297, in absolute_import fromlist_w, tentative) File ".../pypy/module/imp/importing.py", line 306, in absolute_import_with_lock fromlist_w, tentative) File ".../pypy/module/imp/importing.py", line 367, in _absolute_import tentative=tentative) File ".../pypy/module/imp/importing.py", line 646, in load_part w_mod = load_module(space, w_modulename, find_info) File ".../pypy/module/imp/importing.py", line 601, in load_module find_info.stream.readall()) File ".../pypy/module/imp/importing.py", line 911, in load_source_module exec_code_module(space, w_mod, code_w) File ".../pypy/module/imp/importing.py", line 872, in exec_code_module code_w.exec_code(space, w_dict, w_dict) File ".../pypy/interpreter/eval.py", line 33, in exec_code return frame.run() File ".../pypy/interpreter/pyframe.py", line 141, in run return self.execute_frame() File ".../pypy/interpreter/pyframe.py", line 175, in execute_frame executioncontext) File ".../pypy/interpreter/pyopcode.py", line 84, in dispatch next_instr = self.handle_bytecode(co_code, next_instr, ec) File ".../pypy/interpreter/pyopcode.py", line 107, in handle_bytecode rstackovf.check_stack_overflow() File ".../pypy/interpreter/pyopcode.py", line 90, in handle_bytecode next_instr = self.dispatch_bytecode(co_code, next_instr, ec) File ".../pypy/interpreter/pyopcode.py", line 265, in dispatch_bytecode res = meth(oparg, next_instr) File ".../pypy/interpreter/pyopcode.py", line 1029, in CALL_FUNCTION self.call_function(oparg) File ".../pypy/interpreter/pyopcode.py", line 1012, in call_function w_result = self.space.call_args(w_function, args) File ".../pypy/objspace/descroperation.py", line 158, in call_args return w_obj.call_args(args) File ".../pypy/interpreter/function.py", line 59, in call_args return self.getcode().funcrun(self, args) File ".../pypy/interpreter/pycode.py", line 210, in funcrun return frame.run() File ".../pypy/interpreter/pyframe.py", line 141, in run return self.execute_frame() File ".../pypy/interpreter/pyframe.py", line 175, in execute_frame executioncontext) File ".../pypy/interpreter/pyopcode.py", line 84, in dispatch next_instr = self.handle_bytecode(co_code, next_instr, ec) File ".../pypy/interpreter/pyopcode.py", line 107, in handle_bytecode rstackovf.check_stack_overflow() File ".../pypy/interpreter/pyopcode.py", line 90, in handle_bytecode next_instr = self.dispatch_bytecode(co_code, next_instr, ec) File ".../pypy/interpreter/pyopcode.py", line 265, in dispatch_bytecode res = meth(oparg, next_instr) File ".../pypy/interpreter/pyopcode.py", line 1022, in CALL_FUNCTION w_result = self.space.call_valuestack(w_function, nargs, self) File ".../pypy/interpreter/baseobjspace.py", line 1015, in call_valuestack return w_func.funccall_valuestack(nargs, frame) File ".../pypy/interpreter/function.py", line 147, in funccall_valuestack return self._flat_pycall(code, nargs, frame) File ".../pypy/interpreter/function.py", line 172, in _flat_pycall return new_frame.run() File ".../pypy/interpreter/pyframe.py", line 141, in run return self.execute_frame() File ".../pypy/interpreter/pyframe.py", line 175, in execute_frame executioncontext) File ".../pypy/interpreter/pyopcode.py", line 84, in dispatch next_instr = self.handle_bytecode(co_code, next_instr, ec) File ".../pypy/interpreter/pyopcode.py", line 107, in handle_bytecode rstackovf.check_stack_overflow() File ".../pypy/interpreter/pyopcode.py", line 90, in handle_bytecode next_instr = self.dispatch_bytecode(co_code, next_instr, ec) File ".../pypy/interpreter/pyopcode.py", line 265, in dispatch_bytecode res = meth(oparg, next_instr) File ".../pypy/interpreter/pyopcode.py", line 1022, in CALL_FUNCTION w_result = self.space.call_valuestack(w_function, nargs, self) File ".../pypy/interpreter/baseobjspace.py", line 1015, in call_valuestack return w_func.funccall_valuestack(nargs, frame) File ".../pypy/interpreter/function.py", line 147, in funccall_valuestack return self._flat_pycall(code, nargs, frame) File ".../pypy/interpreter/function.py", line 172, in _flat_pycall return new_frame.run() File ".../pypy/interpreter/pyframe.py", line 141, in run return self.execute_frame() File ".../pypy/interpreter/pyframe.py", line 175, in execute_frame executioncontext) File ".../pypy/interpreter/pyopcode.py", line 84, in dispatch next_instr = self.handle_bytecode(co_code, next_instr, ec) File ".../pypy/interpreter/pyopcode.py", line 107, in handle_bytecode rstackovf.check_stack_overflow() File ".../pypy/interpreter/pyopcode.py", line 90, in handle_bytecode next_instr = self.dispatch_bytecode(co_code, next_instr, ec) File ".../pypy/interpreter/pyopcode.py", line 265, in dispatch_bytecode res = meth(oparg, next_instr) File ".../pypy/interpreter/pyopcode.py", line 1022, in CALL_FUNCTION w_result = self.space.call_valuestack(w_function, nargs, self) File ".../pypy/interpreter/baseobjspace.py", line 1015, in call_valuestack return w_func.funccall_valuestack(nargs, frame) File ".../pypy/interpreter/function.py", line 147, in funccall_valuestack return self._flat_pycall(code, nargs, frame) File ".../pypy/interpreter/function.py", line 172, in _flat_pycall return new_frame.run() File ".../pypy/interpreter/pyframe.py", line 141, in run return self.execute_frame() File ".../pypy/interpreter/pyframe.py", line 175, in execute_frame executioncontext) File ".../pypy/interpreter/pyopcode.py", line 84, in dispatch next_instr = self.handle_bytecode(co_code, next_instr, ec) File ".../pypy/interpreter/pyopcode.py", line 107, in handle_bytecode rstackovf.check_stack_overflow() File ".../pypy/interpreter/pyopcode.py", line 90, in handle_bytecode next_instr = self.dispatch_bytecode(co_code, next_instr, ec) File ".../pypy/interpreter/pyopcode.py", line 265, in dispatch_bytecode res = meth(oparg, next_instr) File ".../pypy/interpreter/pyopcode.py", line 1022, in CALL_FUNCTION w_result = self.space.call_valuestack(w_function, nargs, self) File ".../pypy/interpreter/baseobjspace.py", line 1015, in call_valuestack return w_func.funccall_valuestack(nargs, frame) File ".../pypy/interpreter/function.py", line 147, in funccall_valuestack return self._flat_pycall(code, nargs, frame) File ".../pypy/interpreter/function.py", line 172, in _flat_pycall return new_frame.run() File ".../pypy/interpreter/pyframe.py", line 141, in run return self.execute_frame() File ".../pypy/interpreter/pyframe.py", line 175, in execute_frame executioncontext) File ".../pypy/interpreter/pyopcode.py", line 84, in dispatch next_instr = self.handle_bytecode(co_code, next_instr, ec) File ".../pypy/interpreter/pyopcode.py", line 107, in handle_bytecode rstackovf.check_stack_overflow() File ".../pypy/interpreter/pyopcode.py", line 90, in handle_bytecode next_instr = self.dispatch_bytecode(co_code, next_instr, ec) File ".../pypy/interpreter/pyopcode.py", line 265, in dispatch_bytecode res = meth(oparg, next_instr) File ".../pypy/interpreter/pyopcode.py", line 1022, in CALL_FUNCTION w_result = self.space.call_valuestack(w_function, nargs, self) File ".../pypy/interpreter/baseobjspace.py", line 1015, in call_valuestack return w_func.funccall_valuestack(nargs, frame) File ".../pypy/interpreter/function.py", line 147, in funccall_valuestack return self._flat_pycall(code, nargs, frame) File ".../pypy/interpreter/function.py", line 172, in _flat_pycall return new_frame.run() File ".../pypy/interpreter/pyframe.py", line 141, in run return self.execute_frame() File ".../pypy/interpreter/pyframe.py", line 175, in execute_frame executioncontext) File ".../pypy/interpreter/pyopcode.py", line 84, in dispatch next_instr = self.handle_bytecode(co_code, next_instr, ec) File ".../pypy/interpreter/pyopcode.py", line 107, in handle_bytecode rstackovf.check_stack_overflow() File ".../pypy/interpreter/pyopcode.py", line 90, in handle_bytecode next_instr = self.dispatch_bytecode(co_code, next_instr, ec) File ".../pypy/interpreter/pyopcode.py", line 265, in dispatch_bytecode res = meth(oparg, next_instr) File ".../pypy/interpreter/pyopcode.py", line 1029, in CALL_FUNCTION self.call_function(oparg) File ".../pypy/interpreter/pyopcode.py", line 1012, in call_function w_result = self.space.call_args(w_function, args) File ".../pypy/objspace/descroperation.py", line 160, in call_args return w_obj.call_args(args) File ".../pypy/interpreter/function.py", line 471, in call_args return space.call_obj_args(self.w_function, self.w_instance, args) File ".../pypy/interpreter/baseobjspace.py", line 961, in call_obj_args return w_callable.call_obj_args(w_obj, args) File ".../pypy/interpreter/function.py", line 63, in call_obj_args return self.getcode().funcrun_obj(self, w_obj, args) File ".../pypy/interpreter/pycode.py", line 223, in funcrun_obj return frame.run() File ".../pypy/interpreter/pyframe.py", line 141, in run return self.execute_frame() File ".../pypy/interpreter/pyframe.py", line 175, in execute_frame executioncontext) File ".../pypy/interpreter/pyopcode.py", line 84, in dispatch next_instr = self.handle_bytecode(co_code, next_instr, ec) File ".../pypy/interpreter/pyopcode.py", line 107, in handle_bytecode rstackovf.check_stack_overflow() File ".../pypy/interpreter/pyopcode.py", line 90, in handle_bytecode next_instr = self.dispatch_bytecode(co_code, next_instr, ec) File ".../pypy/interpreter/pyopcode.py", line 265, in dispatch_bytecode res = meth(oparg, next_instr) File ".../pypy/interpreter/pyopcode.py", line 1022, in CALL_FUNCTION w_result = self.space.call_valuestack(w_function, nargs, self) File ".../pypy/interpreter/baseobjspace.py", line 1015, in call_valuestack return w_func.funccall_valuestack(nargs, frame) File ".../pypy/interpreter/function.py", line 147, in funccall_valuestack return self._flat_pycall(code, nargs, frame) File ".../pypy/interpreter/function.py", line 172, in _flat_pycall return new_frame.run() File ".../pypy/interpreter/pyframe.py", line 141, in run return self.execute_frame() File ".../pypy/interpreter/pyframe.py", line 175, in execute_frame executioncontext) File ".../pypy/interpreter/pyopcode.py", line 84, in dispatch next_instr = self.handle_bytecode(co_code, next_instr, ec) File ".../pypy/interpreter/pyopcode.py", line 107, in handle_bytecode rstackovf.check_stack_overflow() File ".../pypy/interpreter/pyopcode.py", line 90, in handle_bytecode next_instr = self.dispatch_bytecode(co_code, next_instr, ec) File ".../pypy/interpreter/pyopcode.py", line 265, in dispatch_bytecode res = meth(oparg, next_instr) File ".../pypy/interpreter/pyopcode.py", line 1022, in CALL_FUNCTION w_result = self.space.call_valuestack(w_function, nargs, self) File ".../pypy/interpreter/baseobjspace.py", line 1015, in call_valuestack return w_func.funccall_valuestack(nargs, frame) File ".../pypy/interpreter/function.py", line 147, in funccall_valuestack return self._flat_pycall(code, nargs, frame) File ".../pypy/interpreter/function.py", line 172, in _flat_pycall return new_frame.run() File ".../pypy/interpreter/pyframe.py", line 141, in run return self.execute_frame() File ".../pypy/interpreter/pyframe.py", line 175, in execute_frame executioncontext) File ".../pypy/interpreter/pyopcode.py", line 84, in dispatch next_instr = self.handle_bytecode(co_code, next_instr, ec) File ".../pypy/interpreter/pyopcode.py", line 107, in handle_bytecode rstackovf.check_stack_overflow() File ".../pypy/interpreter/pyopcode.py", line 90, in handle_bytecode next_instr = self.dispatch_bytecode(co_code, next_instr, ec) File ".../pypy/interpreter/pyopcode.py", line 265, in dispatch_bytecode res = meth(oparg, next_instr) File ".../pypy/interpreter/pyopcode.py", line 1029, in CALL_FUNCTION self.call_function(oparg) File ".../pypy/interpreter/pyopcode.py", line 1012, in call_function w_result = self.space.call_args(w_function, args) File ".../pypy/objspace/descroperation.py", line 158, in call_args return w_obj.call_args(args) File ".../pypy/interpreter/function.py", line 59, in call_args return self.getcode().funcrun(self, args) File ".../pypy/interpreter/pycode.py", line 210, in funcrun return frame.run() File ".../pypy/interpreter/pyframe.py", line 141, in run return self.execute_frame() File ".../pypy/interpreter/pyframe.py", line 175, in execute_frame executioncontext) File ".../pypy/interpreter/pyopcode.py", line 84, in dispatch next_instr = self.handle_bytecode(co_code, next_instr, ec) File ".../pypy/interpreter/pyopcode.py", line 107, in handle_bytecode rstackovf.check_stack_overflow() File ".../pypy/interpreter/pyopcode.py", line 90, in handle_bytecode next_instr = self.dispatch_bytecode(co_code, next_instr, ec) File ".../pypy/interpreter/pyopcode.py", line 265, in dispatch_bytecode res = meth(oparg, next_instr) File ".../pypy/interpreter/pyopcode.py", line 1029, in CALL_FUNCTION self.call_function(oparg) File ".../pypy/interpreter/pyopcode.py", line 1012, in call_function w_result = self.space.call_args(w_function, args) File ".../pypy/objspace/descroperation.py", line 158, in call_args return w_obj.call_args(args) File ".../pypy/interpreter/function.py", line 59, in call_args return self.getcode().funcrun(self, args) File ".../pypy/interpreter/pycode.py", line 210, in funcrun return frame.run() File ".../pypy/interpreter/pyframe.py", line 141, in run return self.execute_frame() File ".../pypy/interpreter/pyframe.py", line 175, in execute_frame executioncontext) File ".../pypy/interpreter/pyopcode.py", line 84, in dispatch next_instr = self.handle_bytecode(co_code, next_instr, ec) File ".../pypy/interpreter/pyopcode.py", line 107, in handle_bytecode rstackovf.check_stack_overflow() File ".../pypy/interpreter/pyopcode.py", line 90, in handle_bytecode next_instr = self.dispatch_bytecode(co_code, next_instr, ec) File ".../pypy/interpreter/pyopcode.py", line 265, in dispatch_bytecode res = meth(oparg, next_instr) File ".../pypy/interpreter/pyopcode.py", line 1022, in CALL_FUNCTION w_result = self.space.call_valuestack(w_function, nargs, self) File ".../pypy/interpreter/baseobjspace.py", line 1015, in call_valuestack return w_func.funccall_valuestack(nargs, frame) File ".../pypy/interpreter/function.py", line 128, in funccall_valuestack return code.fastcall_0(self.space, self) File ".../pypy/interpreter/gateway.py", line 705, in fastcall_0 self.handle_exception(space, e) File ".../pypy/interpreter/gateway.py", line 648, in handle_exception rstackovf.check_stack_overflow() File ".../pypy/interpreter/gateway.py", line 700, in fastcall_0 w_result = self.fastfunc_0(space) File ".../pypy/module/posix/interp_posix.py", line 706, in fork run_fork_hooks('child', space) File ".../pypy/module/posix/interp_posix.py", line 690, in run_fork_hooks hook(space) File ".../pypy/rpython/lltypesystem/rffi.py", line 241, in wrapper res = call_external_function(*real_args) File "<6063-codegen .../pypy/rpython/lltypesystem/rffi.py:168>", line 6, in call_external_function File ".../pypy/rpython/lltypesystem/lltype.py", line 1283, in __call__ return callb(*args) File ".../pypy/rpython/lltypesystem/ll2ctypes.py", line 1164, in __call__ cfunc = get_ctypes_callable(self.funcptr, self.calling_conv) File ".../pypy/rpython/lltypesystem/ll2ctypes.py", line 1138, in get_ctypes_callable funcname, place)) NotImplementedError: function 'PyThread_ReInitTLS' not found in library '/tmp/usession-default-21/module_cache/pypyapi.so' """ I have no clue why it thinks it needs to call that function. Any idea? Stefan

2012/7/3 Stefan Behnel <stefan_ml@behnel.de>
No optimization of any kind have been done in cpyext (it's difficult enough to get it right...) A freelist would be a nice thing, but there would still be the cost of attaching the PyObject to the pypy w_object. Maybe we should use a weak dictionary to cache the PyObject structure. This already exists for objects defined and created from C...
Ah, but this won't work! py.py runs on top of CPython, so the PyString_AsString symbol is already defined by your CPython interpreter! There is a workaround though: compile your extension module with python2.7 pypy/module/cpyext/presetup.py setup.py build_ext -i presetup.py will patch distutils, and create a module "nbody.pypy-19i.so" (note the i) which works on top of an *interpreted* pypy. Among the hacks, all symbols are renamed: #define PyString_AsString PyPyString_AsString. Then this should work: pypy/bin/py.py --withmod-cpyext -c "import nbody" *very* slowly of course, but I was able to debug pygames this way! -- Amaury Forgeot d'Arc

Amaury Forgeot d'Arc, 03.07.2012 18:26:
I'm sure it is.
A freelist would be a nice thing, but there would still be the cost of attaching the PyObject to the pypy w_object.
Any reduction in the cost of passing and cleaning up objects would dramatically improve the overall performance of the interface.
Maybe we should use a weak dictionary to cache the PyObject structure. This already exists for objects defined and created from C...
That would be really helpful. In particular, it would solve one of the most annoying problems that extensions currently have: even if you keep a Python reference to an object, e.g. in a list, its PyObject structure will die once the last C reference to it is gone. That is really hard to work around in some cases. It's very common to keep e.g. a Python list (or set) of byte strings and pass their char* buffer pointers into a C library. That doesn't currently work with cpyext.
Right. I keep forgetting that. This inherent indirection in PyPy makes things seriously complicated. (And no, that's not a good thing.) It would be helpful if it printed an error message giving a hint of why it failed, instead of just stating that it failed to load the extension (I can see that it failed, dammit!).
Is there a reason why an interpreted PyPy cannot always do this? I mean, it can't work without this, can it?
Ok, that did the trick. This should be in a tutorial somewhere. Or maybe I should just dump it into a blog post.
*very* slowly of course, but I was able to debug pygames this way!
The problem is not so much that it's generally slow but that the performance characteristics of the Python code are likely way different than those of the translated C code. That's certainly the case for Cython code, running cProfile over Python code, running it over the compiled module and running callgrind over it often yields totally different results. That's why I would prefer running this through callgrind instead of Python+profile (I noticed that cProfile doesn't work either). Is there a way to get readable debugging symbols in a translated PyPy that would tell me what is being executed? Stefan

Stefan Behnel, 03.07.2012 19:56:
Actually, it did work. I just had to enable the _lsprof module. However, it now prints a trace of every C-API function that it calls, e.g. """ <function PyTuple_CheckExact at 0x42da5a0> DONE <function PyList_CheckExact at 0x43cc5a0> DONE <function PySequence_Size at 0x43e1300> DONE <function PySequence_ITEM at 0x43ef300> DONE <function PySequence_ITEM at 0x43ef300> DONE <function PySequence_ITEM at 0x43ef300> DONE <function Py_DecRef at 0x40aea38> <function subtype_dealloc at 0x4322990> <function PyObject_dealloc at 0x43466f0> <function PyObject_Del at 0x43464f8> DONE DONE DONE DONE """ Is there a way to disable that? That level of verbosity could be a bit costly. Stefan

Stefan Behnel, 03.07.2012 19:56:
Ok, I finally managed to get a profile and it turned out to be completely useless. I tried profiling inside of PyPy, but that doesn't tell me anything about cpyext. I then tried profiling the interpreted PyPy itself from CPython, but that only gives me a profile of the interpreted code, which is clearly different from what the compiled code does. It spends lots of time doing calls, for example, and the top time consuming function is "getattr", followed by "build_ctypes_array". At least "api.py:wrapper" is also somewhat up on the list, but it's just drowning in the noise. Back to that question then:
Is there a way to get readable debugging symbols in a translated PyPy that would tell me what is being executed?
Stefan

On Thu, Jul 5, 2012 at 10:26 AM, Amaury Forgeot d'Arc <amauryfa@gmail.com>wrote:
Default build (not the distribution or nightly, you have to trasnlate yourself), contains debug info. You can do make lldebug to have even more, but even without it it's usable with valgrind. Cheers, fijal

Maciej Fijalkowski, 05.07.2012 11:01:
Ah, yes. Given a suitably large machine and enough time to do other stuff, that did the trick for me. Here's the result: http://cython.org/callgrind-pypy-nbody.png As you can see, make_ref() and Py_DecRef() combined account for almost 80% of the runtime. So the expected gain from optimising the PyObject handling is *huge*. The first thing that pops up from the graph is the different calls through generic_cpy_call(). That looks way to generic for something as performance critical as Py_DecRef(). Also, what's that recursive "stackwalk()" thing doing? Stefan

On Thu, Aug 23, 2012 at 11:11 AM, Stefan Behnel <stefan_ml@behnel.de> wrote:
I took a look at it and it seems it's not what I thought it is. It's just an intermediate call that saves stack roots and then calls the actual cpyext. I don't think the call itself is harmful, it just happens to be on the callstack (always)

Maciej Fijalkowski, 23.08.2012 11:17:
Ah, ok - good to know. Then I think our next best bet is to cache the PyObject structs for PyPy objects using a weak-key dict. That will fix two problems at the same time: 1) prune excessive create-decref-dealloc cycles 2) keep the PyObject pointer valid as long as the PyPy object is alive, thus preventing crashes for code that expects an object reference in a list (for example) to be enough to keep the C object representation alive. Stefan

On Thu, Jul 5, 2012 at 2:35 PM, Stefan Behnel <stefan_ml@behnel.de> wrote:
Note that this is RPython not C. This is not a "generic_call" - this is a specialized version of generic call based on some arguments. That means that whatever could have been determined by those attributes have been constant folded away. In RPython this is very typical - you write a generic version that specializes based on some values to arrive at a specific version. Cheers, fijal

Stefan Behnel, 05.07.2012 14:35:
I've set up a job on our build server to run a couple of (simple) benchmarks comparing Cython's current performance under CPython and PyPy. Note that these are C-API intensive benchmarks by design (they are compiled from Python code), even though they use static type annotations for optimisation. Despite of what you might think about these benchmarks in general, I think they are quite suitable for cpyext. https://sage.math.washington.edu:8091/hudson/job/cython-devel-cybenchmarks-p... The latest results are here: https://sage.math.washington.edu:8091/hudson/job/cython-devel-cybenchmarks-p... They currently run 100-200x slower through cpyext than in CPython 2.7. The build job always uses the latest nightly build of PyPy, so any changes in Cython or PyPy will usually show up within the next 24 hours. The build job also archives the generated .c files (and the original sources, including the HTML version). If anyone wants to play with them, you can just download the C file, build it with distutils, import it and then call its "main(number_of_iterations)" function. The C code works in both PyPy and CPython, although the actual C-API calls differ somewhat between the two. Stefan

Thanks! i might think bad about those benchmarks representing python workloads, howecer they are very likely good for cpyext. good job. On Thursday, July 5, 2012, Stefan Behnel <stefan_ml@behnel.de> wrote:
https://sage.math.washington.edu:8091/hudson/job/cython-devel-cybenchmarks-p...
The latest results are here:
https://sage.math.washington.edu:8091/hudson/job/cython-devel-cybenchmarks-p...

And now for something completely different: Would not it make dramatically more sense to just finish the cppyy branch and get a working extension-making scheme that actually works? I have a project now that uses C++ extensions and python quite extensively, and the only thing stopping me from migrating to cppyy from SWIG is the fact that cppyy is still so unfinished and requires rebuild of pypy with root libraries and all the other tedious things, that slow down deployment. BR, Alex

Hi Alex,
cppyy is still so unfinished
provide a prioritized list of what's still missing for you? I'm following a more or less random walk otherwise, with most work going into the CINT backend at the moment.
and requires rebuild of pypy with root libraries and all the other tedious things, that slow down deployment.
Just Reflex w/o ROOT should work now, but rebuilding pypy-c is going to remain needed for a while, I'm afraid. The only idea that I have for another method that allows deferring to runtime, is to use ctypes instead of rffi for the CAPI. Best regards, Wim -- WLavrijsen@lbl.gov -- +1 (510) 486 6411 -- www.lavrijsen.net

Hi Maciej,
it's in the process. look on ffi-backend branch
looking good! Any chance of getting a W_CTypeFunc.call() taking args instead of args_w? I'd like to pull in the function pointer from app-level, so that loading in the .so's is deferred to run-time, but then do the call from interpreter level from an unwrapped function pointer with interpreter-level args, so the args need no re-wrapping. As an aside, and I know this is hard to do portably (Windows ...), but could load_library() take dl flags for dlopen? Thanks, Wim -- WLavrijsen@lbl.gov -- +1 (510) 486 6411 -- www.lavrijsen.net

2012/7/10 Eleytherios Stamatogiannakis <estama@gmail.com>:
A callback called from C cannot be jitted, unless it has a loop in Python code. The loop in the sqlite library does not count. The JIT needs a complete understanding of all operations in the loop... -- Amaury Forgeot d'Arc

Hi all, On Tue, 10 Jul 2012, Maciej Fijalkowski wrote:
if the JITted function still requires wrapped arguments, then an additional C function needs to be generated to be passed as the actual pointer, to receive the arguments and wrap them. Generating such a C function by the backend then requires a C-API for the calls to do the wrapping. That PyPy C-API is a long awaited project. :) As an aside, would it be sufficient to compile the callback (rather than JIT it)? I.e. to get code that doesn't have guards or the ability to fallback? The input args are coming from C, so the types are guaranteed, and if the code is purely numeric, with no fallback to the interpreter anywhere, then acquiring the GIL shouldn't be needed. Best regards, Wim -- WLavrijsen@lbl.gov -- +1 (510) 486 6411 -- www.lavrijsen.net

Hi Amaury, On Tue, 10 Jul 2012, Amaury Forgeot d'Arc wrote:
I'm trying to resist the urge to say that it should be the Python C-API, with all PyObject*'s replaced by PyPyObject*'s. :) For callbacks, something simpler could do. For example, the PyPyObject*'s passed in (i.e. the wrapped callback arguments), are going to be fully owned by the callback function. That way, the JIT is not blind to what happens to the objects, which should lead to better traces. Myself, I have two needs for callbacks: GUIs and fitting. For the former, performance is a non-issue. These callbacks are things like "button clicked", and its thread safety and clean error handling that's important. As an API, something like PyPyRun_SimpleString or PyPyRun_AnyFile will already do fine. The latter, which is closer to what Eleytherios wants, I think, could be self-contained. More likely, (our) users will want to implement a __call__ member function and pass the callable instance. The idea is to have access to instance-specific data. Since the instance must live somewhere before being send down the wire, its data are basically global. In terms of API, to generate the C stub, a simple PyPyObject_Call() would be enough (the tracking of the relevant PyPyObject* would be done in the bindings layer at the interpreter level; this is how it currently works in PyROOT/CPython as well). Best regards, Wim -- WLavrijsen@lbl.gov -- +1 (510) 486 6411 -- www.lavrijsen.net

2012/7/10 <wlavrijsen@lbl.gov>:
I'm trying to resist the urge to say that it should be the Python C-API, with all PyObject*'s replaced by PyPyObject*'s. :)
I don't think it's a good idea to use pointers with a moving GC. But I may be wrong. In any case, we need a memory model compatible with C. -- Amaury Forgeot d'Arc

wlavrijsen@lbl.gov, 10.07.2012 19:38:
There's no reason it would require wrapped arguments. The input types are known from the static low-level function type, so the JIT compiler can just work with them and adapt the function that is being called. These things are trickier in a non-JIT environment, but we are currently working on a general framework to support low-level calls through the CPython ecosystem (I mentioned that before). Stefan

Hi Stefan,
There's no reason it would require wrapped arguments.
that completely depends on how it is generated, of course, and in the context of calls within Python (pypy-c), it makes sense to have the entry point of the function expect wrapped arguments, and have the exit point re-wrap.
Yes, except that the above only follows after this:
I.e., there should first be a good method of delivering the low-level info. Right now, that delivery method is the act of unwrapping itself (that is, the wrapped types carry the low-level info). Best regards, Wim -- WLavrijsen@lbl.gov -- +1 (510) 486 6411 -- www.lavrijsen.net

wlavrijsen@lbl.gov, 10.07.2012 20:10:
There's no reason you can't have multiple entry points to the same code. Cython has been doing this for a while, and I'm sure PyPy's JIT does it, too.
Sure, that's why we've been working on a specification for a protocol here. Basically, the caller and the callee would have to agree on a common signature at runtime out of the given N:M set of choices. A JIT compiler would obviously prefer generating a suitable signature, given the set of signatures that the other side presents.
Right now, that delivery method is the act of unwrapping itself (that is, the wrapped types carry the low-level info).
I have no idea what you mean, but you make it sound backwards for the case of a callback. Stefan

Hi Stefan, On Tue, 10 Jul 2012, Stefan Behnel wrote:
I'm not sure whether the PyPy JIT does that, as an entry point somewhere in the middle of a compiled trace would bypass guards, and as said re-wrap at an exit point is needed (not to mention if a guard fails half-way during the trace and the code drops in the black hole). But a JIT expert should provide the details of what's possible. :)
See Anto's talk at EuroPython for a better explanation of what I'm saying (or trying to say anyway): https://ep2012.europython.eu/conference/talks/pypy-jit-under-the-hood The type information going into the traces comes from the type information that the interpreter has. I.e. from the wrapped types. This isn't backwards AFAICS, since a general Python function (callback or otherwise) can receive any type (and of course, is allowed to fail by raising an exception if it can't deal with the types). It's only once the trace has been generated, that the types are fixed for the trace (with a fallback option, or course). Now, retrofitting the callback mechanism on top of this, may very well be backwards, which is why I think we all do agree that a better handshake is needed. And if somebody could pretty please code that up for PyPy, so that I can use it. :) As-is, I could even live with CFFI funcptrs taking a wrapped tuple of args. After all, wrapping is easy and fast at the interpreter level. However, the JIT will be blind to it. Best regards, Wim -- WLavrijsen@lbl.gov -- +1 (510) 486 6411 -- www.lavrijsen.net

wlavrijsen@lbl.gov, 10.07.2012 20:37:
Ok, then in the case of a callback, the runtime could simply start with totally stupid code that packs the low-level arguments into Python arguments, attaches the low-level type information to them and then passes them into the function. The JIT should then see in its trace what types originally came in and optimise the argument packing away, shouldn't it? Stefan

Hi Stefan,
no, don't think so: the JIT works by seeing code execute (the "tracing" part of it), so with the above recipe, there's still a chicken-and-egg problem. That is, in order to JIT, the code in the callback needs to be executed, but to execute, the trace entry point is needed, but for that, the JIT needs to run ... circle. :/ I was more thinking of an automated equivalent of the toolchain: http://doc.pypy.org/en/latest/getting-started-dev.html#id13 with annotations provided programmatically. But if that worked cleanly on pypy-c, I'd have expected separable "builtin" modules already. :) Best regards, Wim -- WLavrijsen@lbl.gov -- +1 (510) 486 6411 -- www.lavrijsen.net

wlavrijsen@lbl.gov, 10.07.2012 21:02:
Depends. The initial (stupid) entry point would be generated in the moment where you assign a Python function to a C function pointer. You'd basically pass a pointer to a wrapper function that calls your Python function. The question is if a conversion from low-level types to Python types can be seen and handled by the JIT, but I'd be surprised if it couldn't, since it generates this kind of code itself anyway. It's the same as calling a previously unoptimised (Python) branch from an optimised (low-level) one. Stefan

Hi Stefan,
this is beyond my expertise, but one way of making the wrapping visible, is to generate the C stub with types, transform those in a memory block with the needed annotations, then push that block and annotations into PyPy, and do the wrapping and python call in RPython. This also by-passes the need for a C-API (other than that one API to push the memory + annotations). Sounds like something I could do today, but still feels like a square peg / round hole kind of solution, since the code to trigger that C stub generation would (in cppyy's case) start in RPython to begin with. Makes me wish I could generate RPython (after translation, that is). Best regards, Wim -- WLavrijsen@lbl.gov -- +1 (510) 486 6411 -- www.lavrijsen.net

Stefan Behnel, 03.07.2012 19:56:
Ok, so where would this have to be done? Is there a way to implement it generically in that ubiquitous space.wrap() kind of call (whatever that does internally), or would it have to be done explicitly whenever objects pass the boundary? Stefan

Hi Stefan, On Wed, Aug 29, 2012 at 10:29 PM, Stefan Behnel <stefan_ml@behnel.de> wrote:
You need to do it at the level of the boundary --- space.wrap() is at a different level. We can't slow it down by doing dictionary lookups there, because it would be a big hit of performance for the whole PyPy. Fwiw I'm now thinking about a possible rewrite of cpyext, based on (an extension of) CFFI that would let us declare a C API and implement it in Python. This would mean basically moving cpyext to a regular app-level module. Doing so in general would be useful to write C extensions or wrappers for existing libraries: they would not necessarily be using the CPython C API but instead any custom C API that is most suitable for that particular extension. For Cython, it would mean that you would be free to invent whatever API is most suitable --- starting, I'm sure, with some subset of the CPython C API, but possibly evolving over time towards something that offers better performance in the context of PyPy. A bientôt, Armin.

Hi Armin, Armin Rigo, 30.08.2012 09:10:
Understood. This may not even be that hard to do, but it's unlikely that I'll give this a try myself. The turn-over times of building PyPy for an hour or so on a remote machine (RAM!) for most non-trivial changes are just way too long to make friends with the code base. Even waiting for minutes to get an interpreted PyPy in a state where it can start building (let alone running) Cython extensions is way beyond the seconds in my usual work flow.
Assuming that's doable in a reasonable time frame (including proper JITting of the API implementation to make it reasonably fast), that sounds like a good idea. Cython already uses quite a bit of special casing for cpyext by now, so replacing at least some of the more involved code blocks with a more direct C-API call would be nice. Actually the subset of the CPython C-API that Cython uses is not even all that large. Instead, we inline and special case a lot of cases that CPython needs to broaden its API for and tend to use the most low-level functions only. So Cython should prove a suitable target for such an API implementation. Funny enough, I had to change many low-level calls in CPython (PyTuple_*) to high-level calls in cpyext (PySequence_*), both because they are substantially faster (less useless error testing) and less likely to crash (no borrowed reference handling). It also sounds like this approach would actually enable a workable test work-flow for people like me. Stefan

Hi Stefan, On Thu, Aug 30, 2012 at 10:06 AM, Stefan Behnel <stefan_ml@behnel.de> wrote:
This is a blocker, and one that the plan towards CFFI should completely fix. It's going to be so much more fun to write regular Python code and try it immediately :-) rather than hack at the interpreter and have to wait minutes for tests to run or an hour for a full pypy build. A bientôt, Armin.

Hi, On Thu, Aug 30, 2012 at 10:28 AM, Armin Rigo <arigo@tunes.org> wrote:
This is a blocker, and one that the plan towards CFFI should completely fix.
A quick but successful attempt can be found in cffi/demo/{api,pyobj}.py. api.py is a bit of a hack but lets you declare functions with the decorator @ffi.pyexport("C function signature"). This gives you Python-implemented functions that you can call back from C. They are available from the C code passed into verify(). (A possible extension of this idea would be to choose the precise name and C-level interface that we want to give to the .so; with only a few extra hacks this could let the user write pure Python CFFI code to implement a .so that can be used in the system without knowledge that it is written in Python in the first place, e.g. as a plug-in to some existing program.) pyobj.py is a quick attempt at exposing the C API I described earlier, with "object descriptors". So far the API is only sufficient for the demo in this file, which is a C function that iterates over a list and computes the sum of its items. In the first version it assumes all items are numbers that fit in an "int". In the second version it really calls the Python addition, and so supports various kinds of objects. Note how the whole API is basically defined and implemented as regular Python code. These examples should work well on both CPython and PyPy. A bientôt, Armin.

Maciej Fijalkowski, 04.07.2012 10:48:
Done. https://bitbucket.org/pypy/pypy/wiki/Speeding%20up%20cpyext Stefan

Hi, 2012/7/3 Stefan Behnel <stefan_ml@behnel.de>
Don't forget that cpyext reference counts are quite different from CPython: PySequence_GetItem() needs to *create* a PyObject structure, and the returned object has a refcount of 1. Then Py_DECREF() will really *deallocate* the PyObject structure... This is quite more expensive than the simple refcount increment/decrement done by CPython.
OverflowError: array too large
Looks like a ctypes bug to me. Which OS, Python, etc. are you using? -- Amaury Forgeot d'Arc

Amaury Forgeot d'Arc, 03.07.2012 14:55:
Sure. And using PySequence_GetItem() is still some 10x faster in PyPy than taking a borrowed reference using PyList_GET_ITEM() and then calling Py_INCREF() on it, which is what Cython does in CPython because it's way faster there. The overhead of borrowed references is seriously huge in PyPy. BTW, are PyObject structures currently cached in a free-list somewhere? That would be really helpful for the iteration performance.
Ah - totally, sure. I accidentally ran the system Py2.5 on 64bit Linux. Running it with Py2.7 fixes this specific problem, thanks for the hint! Although it now names the extension module "nbody.so" instead of "nbody.pypy-19.so". Comprend qui peut ... After figuring out that I was supposed to enable cpyext manually and running strace to see what extension module name it is actually looking for, I failed to make it load the module it just built regardless of how I named it, so I tried building it within the same run as follows: pypy/bin/py.py --withmod-cpyext -c 'import setup; import nbody; \ nbody.test_nbody(1)' build_ext -i Doing this, it then fails with the following error: """Traceback (most recent call last): File ".../pypy/bin/py.py", line 187, in <module> sys.exit(main_(sys.argv)) File ".../pypy/bin/py.py", line 158, in main_ verbose=interactiveconfig.verbose): File ".../pypy/interpreter/main.py", line 103, in run_toplevel f() File ".../pypy/bin/py.py", line 133, in doit main.run_string(command, space=space) File ".../pypy/interpreter/main.py", line 59, in run_string _run_eval_string(source, filename, space, False) File ".../pypy/interpreter/main.py", line 48, in _run_eval_string retval = pycode.exec_code(space, w_globals, w_globals) File ".../pypy/interpreter/eval.py", line 33, in exec_code return frame.run() File ".../pypy/interpreter/pyframe.py", line 141, in run return self.execute_frame() File ".../pypy/interpreter/pyframe.py", line 175, in execute_frame executioncontext) File ".../pypy/interpreter/pyopcode.py", line 84, in dispatch next_instr = self.handle_bytecode(co_code, next_instr, ec) File ".../pypy/interpreter/pyopcode.py", line 107, in handle_bytecode rstackovf.check_stack_overflow() File ".../pypy/interpreter/pyopcode.py", line 90, in handle_bytecode next_instr = self.dispatch_bytecode(co_code, next_instr, ec) File ".../pypy/interpreter/pyopcode.py", line 265, in dispatch_bytecode res = meth(oparg, next_instr) File ".../pypy/interpreter/pyopcode.py", line 834, in IMPORT_NAME w_locals, w_fromlist) File ".../pypy/interpreter/baseobjspace.py", line 986, in call_function return w_func.funccall(*args_w) File ".../pypy/interpreter/function.py", line 111, in funccall return self.call_args(Arguments(self.space, list(args_w))) File ".../pypy/interpreter/function.py", line 59, in call_args return self.getcode().funcrun(self, args) File ".../pypy/interpreter/gateway.py", line 614, in funcrun return BuiltinCode.funcrun_obj(self, func, None, args) File ".../pypy/interpreter/gateway.py", line 631, in funcrun_obj self.handle_exception(space, e) File ".../pypy/interpreter/gateway.py", line 648, in handle_exception rstackovf.check_stack_overflow() File ".../pypy/interpreter/gateway.py", line 622, in funcrun_obj w_result = activation._run(space, scope_w) File "<6076-codegen .../pypy/tool/sourcetools.py:174>", line 3, in _run File ".../pypy/module/imp/importing.py", line 271, in importhook fromlist_w, tentative=True) File ".../pypy/module/imp/importing.py", line 297, in absolute_import fromlist_w, tentative) File ".../pypy/module/imp/importing.py", line 306, in absolute_import_with_lock fromlist_w, tentative) File ".../pypy/module/imp/importing.py", line 367, in _absolute_import tentative=tentative) File ".../pypy/module/imp/importing.py", line 646, in load_part w_mod = load_module(space, w_modulename, find_info) File ".../pypy/module/imp/importing.py", line 601, in load_module find_info.stream.readall()) File ".../pypy/module/imp/importing.py", line 911, in load_source_module exec_code_module(space, w_mod, code_w) File ".../pypy/module/imp/importing.py", line 872, in exec_code_module code_w.exec_code(space, w_dict, w_dict) File ".../pypy/interpreter/eval.py", line 33, in exec_code return frame.run() File ".../pypy/interpreter/pyframe.py", line 141, in run return self.execute_frame() File ".../pypy/interpreter/pyframe.py", line 175, in execute_frame executioncontext) File ".../pypy/interpreter/pyopcode.py", line 84, in dispatch next_instr = self.handle_bytecode(co_code, next_instr, ec) File ".../pypy/interpreter/pyopcode.py", line 107, in handle_bytecode rstackovf.check_stack_overflow() File ".../pypy/interpreter/pyopcode.py", line 90, in handle_bytecode next_instr = self.dispatch_bytecode(co_code, next_instr, ec) File ".../pypy/interpreter/pyopcode.py", line 265, in dispatch_bytecode res = meth(oparg, next_instr) File ".../pypy/interpreter/pyopcode.py", line 1029, in CALL_FUNCTION self.call_function(oparg) File ".../pypy/interpreter/pyopcode.py", line 1012, in call_function w_result = self.space.call_args(w_function, args) File ".../pypy/objspace/descroperation.py", line 158, in call_args return w_obj.call_args(args) File ".../pypy/interpreter/function.py", line 59, in call_args return self.getcode().funcrun(self, args) File ".../pypy/interpreter/pycode.py", line 210, in funcrun return frame.run() File ".../pypy/interpreter/pyframe.py", line 141, in run return self.execute_frame() File ".../pypy/interpreter/pyframe.py", line 175, in execute_frame executioncontext) File ".../pypy/interpreter/pyopcode.py", line 84, in dispatch next_instr = self.handle_bytecode(co_code, next_instr, ec) File ".../pypy/interpreter/pyopcode.py", line 107, in handle_bytecode rstackovf.check_stack_overflow() File ".../pypy/interpreter/pyopcode.py", line 90, in handle_bytecode next_instr = self.dispatch_bytecode(co_code, next_instr, ec) File ".../pypy/interpreter/pyopcode.py", line 265, in dispatch_bytecode res = meth(oparg, next_instr) File ".../pypy/interpreter/pyopcode.py", line 1022, in CALL_FUNCTION w_result = self.space.call_valuestack(w_function, nargs, self) File ".../pypy/interpreter/baseobjspace.py", line 1015, in call_valuestack return w_func.funccall_valuestack(nargs, frame) File ".../pypy/interpreter/function.py", line 147, in funccall_valuestack return self._flat_pycall(code, nargs, frame) File ".../pypy/interpreter/function.py", line 172, in _flat_pycall return new_frame.run() File ".../pypy/interpreter/pyframe.py", line 141, in run return self.execute_frame() File ".../pypy/interpreter/pyframe.py", line 175, in execute_frame executioncontext) File ".../pypy/interpreter/pyopcode.py", line 84, in dispatch next_instr = self.handle_bytecode(co_code, next_instr, ec) File ".../pypy/interpreter/pyopcode.py", line 107, in handle_bytecode rstackovf.check_stack_overflow() File ".../pypy/interpreter/pyopcode.py", line 90, in handle_bytecode next_instr = self.dispatch_bytecode(co_code, next_instr, ec) File ".../pypy/interpreter/pyopcode.py", line 265, in dispatch_bytecode res = meth(oparg, next_instr) File ".../pypy/interpreter/pyopcode.py", line 1022, in CALL_FUNCTION w_result = self.space.call_valuestack(w_function, nargs, self) File ".../pypy/interpreter/baseobjspace.py", line 1015, in call_valuestack return w_func.funccall_valuestack(nargs, frame) File ".../pypy/interpreter/function.py", line 147, in funccall_valuestack return self._flat_pycall(code, nargs, frame) File ".../pypy/interpreter/function.py", line 172, in _flat_pycall return new_frame.run() File ".../pypy/interpreter/pyframe.py", line 141, in run return self.execute_frame() File ".../pypy/interpreter/pyframe.py", line 175, in execute_frame executioncontext) File ".../pypy/interpreter/pyopcode.py", line 84, in dispatch next_instr = self.handle_bytecode(co_code, next_instr, ec) File ".../pypy/interpreter/pyopcode.py", line 107, in handle_bytecode rstackovf.check_stack_overflow() File ".../pypy/interpreter/pyopcode.py", line 90, in handle_bytecode next_instr = self.dispatch_bytecode(co_code, next_instr, ec) File ".../pypy/interpreter/pyopcode.py", line 265, in dispatch_bytecode res = meth(oparg, next_instr) File ".../pypy/interpreter/pyopcode.py", line 1022, in CALL_FUNCTION w_result = self.space.call_valuestack(w_function, nargs, self) File ".../pypy/interpreter/baseobjspace.py", line 1015, in call_valuestack return w_func.funccall_valuestack(nargs, frame) File ".../pypy/interpreter/function.py", line 147, in funccall_valuestack return self._flat_pycall(code, nargs, frame) File ".../pypy/interpreter/function.py", line 172, in _flat_pycall return new_frame.run() File ".../pypy/interpreter/pyframe.py", line 141, in run return self.execute_frame() File ".../pypy/interpreter/pyframe.py", line 175, in execute_frame executioncontext) File ".../pypy/interpreter/pyopcode.py", line 84, in dispatch next_instr = self.handle_bytecode(co_code, next_instr, ec) File ".../pypy/interpreter/pyopcode.py", line 107, in handle_bytecode rstackovf.check_stack_overflow() File ".../pypy/interpreter/pyopcode.py", line 90, in handle_bytecode next_instr = self.dispatch_bytecode(co_code, next_instr, ec) File ".../pypy/interpreter/pyopcode.py", line 265, in dispatch_bytecode res = meth(oparg, next_instr) File ".../pypy/interpreter/pyopcode.py", line 1022, in CALL_FUNCTION w_result = self.space.call_valuestack(w_function, nargs, self) File ".../pypy/interpreter/baseobjspace.py", line 1015, in call_valuestack return w_func.funccall_valuestack(nargs, frame) File ".../pypy/interpreter/function.py", line 147, in funccall_valuestack return self._flat_pycall(code, nargs, frame) File ".../pypy/interpreter/function.py", line 172, in _flat_pycall return new_frame.run() File ".../pypy/interpreter/pyframe.py", line 141, in run return self.execute_frame() File ".../pypy/interpreter/pyframe.py", line 175, in execute_frame executioncontext) File ".../pypy/interpreter/pyopcode.py", line 84, in dispatch next_instr = self.handle_bytecode(co_code, next_instr, ec) File ".../pypy/interpreter/pyopcode.py", line 107, in handle_bytecode rstackovf.check_stack_overflow() File ".../pypy/interpreter/pyopcode.py", line 90, in handle_bytecode next_instr = self.dispatch_bytecode(co_code, next_instr, ec) File ".../pypy/interpreter/pyopcode.py", line 265, in dispatch_bytecode res = meth(oparg, next_instr) File ".../pypy/interpreter/pyopcode.py", line 1022, in CALL_FUNCTION w_result = self.space.call_valuestack(w_function, nargs, self) File ".../pypy/interpreter/baseobjspace.py", line 1015, in call_valuestack return w_func.funccall_valuestack(nargs, frame) File ".../pypy/interpreter/function.py", line 147, in funccall_valuestack return self._flat_pycall(code, nargs, frame) File ".../pypy/interpreter/function.py", line 172, in _flat_pycall return new_frame.run() File ".../pypy/interpreter/pyframe.py", line 141, in run return self.execute_frame() File ".../pypy/interpreter/pyframe.py", line 175, in execute_frame executioncontext) File ".../pypy/interpreter/pyopcode.py", line 84, in dispatch next_instr = self.handle_bytecode(co_code, next_instr, ec) File ".../pypy/interpreter/pyopcode.py", line 107, in handle_bytecode rstackovf.check_stack_overflow() File ".../pypy/interpreter/pyopcode.py", line 90, in handle_bytecode next_instr = self.dispatch_bytecode(co_code, next_instr, ec) File ".../pypy/interpreter/pyopcode.py", line 265, in dispatch_bytecode res = meth(oparg, next_instr) File ".../pypy/interpreter/pyopcode.py", line 1029, in CALL_FUNCTION self.call_function(oparg) File ".../pypy/interpreter/pyopcode.py", line 1012, in call_function w_result = self.space.call_args(w_function, args) File ".../pypy/objspace/descroperation.py", line 160, in call_args return w_obj.call_args(args) File ".../pypy/interpreter/function.py", line 471, in call_args return space.call_obj_args(self.w_function, self.w_instance, args) File ".../pypy/interpreter/baseobjspace.py", line 961, in call_obj_args return w_callable.call_obj_args(w_obj, args) File ".../pypy/interpreter/function.py", line 63, in call_obj_args return self.getcode().funcrun_obj(self, w_obj, args) File ".../pypy/interpreter/pycode.py", line 223, in funcrun_obj return frame.run() File ".../pypy/interpreter/pyframe.py", line 141, in run return self.execute_frame() File ".../pypy/interpreter/pyframe.py", line 175, in execute_frame executioncontext) File ".../pypy/interpreter/pyopcode.py", line 84, in dispatch next_instr = self.handle_bytecode(co_code, next_instr, ec) File ".../pypy/interpreter/pyopcode.py", line 107, in handle_bytecode rstackovf.check_stack_overflow() File ".../pypy/interpreter/pyopcode.py", line 90, in handle_bytecode next_instr = self.dispatch_bytecode(co_code, next_instr, ec) File ".../pypy/interpreter/pyopcode.py", line 265, in dispatch_bytecode res = meth(oparg, next_instr) File ".../pypy/interpreter/pyopcode.py", line 1022, in CALL_FUNCTION w_result = self.space.call_valuestack(w_function, nargs, self) File ".../pypy/interpreter/baseobjspace.py", line 1015, in call_valuestack return w_func.funccall_valuestack(nargs, frame) File ".../pypy/interpreter/function.py", line 147, in funccall_valuestack return self._flat_pycall(code, nargs, frame) File ".../pypy/interpreter/function.py", line 172, in _flat_pycall return new_frame.run() File ".../pypy/interpreter/pyframe.py", line 141, in run return self.execute_frame() File ".../pypy/interpreter/pyframe.py", line 175, in execute_frame executioncontext) File ".../pypy/interpreter/pyopcode.py", line 84, in dispatch next_instr = self.handle_bytecode(co_code, next_instr, ec) File ".../pypy/interpreter/pyopcode.py", line 107, in handle_bytecode rstackovf.check_stack_overflow() File ".../pypy/interpreter/pyopcode.py", line 90, in handle_bytecode next_instr = self.dispatch_bytecode(co_code, next_instr, ec) File ".../pypy/interpreter/pyopcode.py", line 265, in dispatch_bytecode res = meth(oparg, next_instr) File ".../pypy/interpreter/pyopcode.py", line 1022, in CALL_FUNCTION w_result = self.space.call_valuestack(w_function, nargs, self) File ".../pypy/interpreter/baseobjspace.py", line 1015, in call_valuestack return w_func.funccall_valuestack(nargs, frame) File ".../pypy/interpreter/function.py", line 147, in funccall_valuestack return self._flat_pycall(code, nargs, frame) File ".../pypy/interpreter/function.py", line 172, in _flat_pycall return new_frame.run() File ".../pypy/interpreter/pyframe.py", line 141, in run return self.execute_frame() File ".../pypy/interpreter/pyframe.py", line 175, in execute_frame executioncontext) File ".../pypy/interpreter/pyopcode.py", line 84, in dispatch next_instr = self.handle_bytecode(co_code, next_instr, ec) File ".../pypy/interpreter/pyopcode.py", line 107, in handle_bytecode rstackovf.check_stack_overflow() File ".../pypy/interpreter/pyopcode.py", line 90, in handle_bytecode next_instr = self.dispatch_bytecode(co_code, next_instr, ec) File ".../pypy/interpreter/pyopcode.py", line 265, in dispatch_bytecode res = meth(oparg, next_instr) File ".../pypy/interpreter/pyopcode.py", line 1029, in CALL_FUNCTION self.call_function(oparg) File ".../pypy/interpreter/pyopcode.py", line 1012, in call_function w_result = self.space.call_args(w_function, args) File ".../pypy/objspace/descroperation.py", line 158, in call_args return w_obj.call_args(args) File ".../pypy/interpreter/function.py", line 59, in call_args return self.getcode().funcrun(self, args) File ".../pypy/interpreter/pycode.py", line 210, in funcrun return frame.run() File ".../pypy/interpreter/pyframe.py", line 141, in run return self.execute_frame() File ".../pypy/interpreter/pyframe.py", line 175, in execute_frame executioncontext) File ".../pypy/interpreter/pyopcode.py", line 84, in dispatch next_instr = self.handle_bytecode(co_code, next_instr, ec) File ".../pypy/interpreter/pyopcode.py", line 107, in handle_bytecode rstackovf.check_stack_overflow() File ".../pypy/interpreter/pyopcode.py", line 90, in handle_bytecode next_instr = self.dispatch_bytecode(co_code, next_instr, ec) File ".../pypy/interpreter/pyopcode.py", line 265, in dispatch_bytecode res = meth(oparg, next_instr) File ".../pypy/interpreter/pyopcode.py", line 1029, in CALL_FUNCTION self.call_function(oparg) File ".../pypy/interpreter/pyopcode.py", line 1012, in call_function w_result = self.space.call_args(w_function, args) File ".../pypy/objspace/descroperation.py", line 158, in call_args return w_obj.call_args(args) File ".../pypy/interpreter/function.py", line 59, in call_args return self.getcode().funcrun(self, args) File ".../pypy/interpreter/pycode.py", line 210, in funcrun return frame.run() File ".../pypy/interpreter/pyframe.py", line 141, in run return self.execute_frame() File ".../pypy/interpreter/pyframe.py", line 175, in execute_frame executioncontext) File ".../pypy/interpreter/pyopcode.py", line 84, in dispatch next_instr = self.handle_bytecode(co_code, next_instr, ec) File ".../pypy/interpreter/pyopcode.py", line 107, in handle_bytecode rstackovf.check_stack_overflow() File ".../pypy/interpreter/pyopcode.py", line 90, in handle_bytecode next_instr = self.dispatch_bytecode(co_code, next_instr, ec) File ".../pypy/interpreter/pyopcode.py", line 265, in dispatch_bytecode res = meth(oparg, next_instr) File ".../pypy/interpreter/pyopcode.py", line 1022, in CALL_FUNCTION w_result = self.space.call_valuestack(w_function, nargs, self) File ".../pypy/interpreter/baseobjspace.py", line 1015, in call_valuestack return w_func.funccall_valuestack(nargs, frame) File ".../pypy/interpreter/function.py", line 128, in funccall_valuestack return code.fastcall_0(self.space, self) File ".../pypy/interpreter/gateway.py", line 705, in fastcall_0 self.handle_exception(space, e) File ".../pypy/interpreter/gateway.py", line 648, in handle_exception rstackovf.check_stack_overflow() File ".../pypy/interpreter/gateway.py", line 700, in fastcall_0 w_result = self.fastfunc_0(space) File ".../pypy/module/posix/interp_posix.py", line 706, in fork run_fork_hooks('child', space) File ".../pypy/module/posix/interp_posix.py", line 690, in run_fork_hooks hook(space) File ".../pypy/rpython/lltypesystem/rffi.py", line 241, in wrapper res = call_external_function(*real_args) File "<6063-codegen .../pypy/rpython/lltypesystem/rffi.py:168>", line 6, in call_external_function File ".../pypy/rpython/lltypesystem/lltype.py", line 1283, in __call__ return callb(*args) File ".../pypy/rpython/lltypesystem/ll2ctypes.py", line 1164, in __call__ cfunc = get_ctypes_callable(self.funcptr, self.calling_conv) File ".../pypy/rpython/lltypesystem/ll2ctypes.py", line 1138, in get_ctypes_callable funcname, place)) NotImplementedError: function 'PyThread_ReInitTLS' not found in library '/tmp/usession-default-21/module_cache/pypyapi.so' """ I have no clue why it thinks it needs to call that function. Any idea? Stefan

2012/7/3 Stefan Behnel <stefan_ml@behnel.de>
No optimization of any kind have been done in cpyext (it's difficult enough to get it right...) A freelist would be a nice thing, but there would still be the cost of attaching the PyObject to the pypy w_object. Maybe we should use a weak dictionary to cache the PyObject structure. This already exists for objects defined and created from C...
Ah, but this won't work! py.py runs on top of CPython, so the PyString_AsString symbol is already defined by your CPython interpreter! There is a workaround though: compile your extension module with python2.7 pypy/module/cpyext/presetup.py setup.py build_ext -i presetup.py will patch distutils, and create a module "nbody.pypy-19i.so" (note the i) which works on top of an *interpreted* pypy. Among the hacks, all symbols are renamed: #define PyString_AsString PyPyString_AsString. Then this should work: pypy/bin/py.py --withmod-cpyext -c "import nbody" *very* slowly of course, but I was able to debug pygames this way! -- Amaury Forgeot d'Arc

Amaury Forgeot d'Arc, 03.07.2012 18:26:
I'm sure it is.
A freelist would be a nice thing, but there would still be the cost of attaching the PyObject to the pypy w_object.
Any reduction in the cost of passing and cleaning up objects would dramatically improve the overall performance of the interface.
Maybe we should use a weak dictionary to cache the PyObject structure. This already exists for objects defined and created from C...
That would be really helpful. In particular, it would solve one of the most annoying problems that extensions currently have: even if you keep a Python reference to an object, e.g. in a list, its PyObject structure will die once the last C reference to it is gone. That is really hard to work around in some cases. It's very common to keep e.g. a Python list (or set) of byte strings and pass their char* buffer pointers into a C library. That doesn't currently work with cpyext.
Right. I keep forgetting that. This inherent indirection in PyPy makes things seriously complicated. (And no, that's not a good thing.) It would be helpful if it printed an error message giving a hint of why it failed, instead of just stating that it failed to load the extension (I can see that it failed, dammit!).
Is there a reason why an interpreted PyPy cannot always do this? I mean, it can't work without this, can it?
Ok, that did the trick. This should be in a tutorial somewhere. Or maybe I should just dump it into a blog post.
*very* slowly of course, but I was able to debug pygames this way!
The problem is not so much that it's generally slow but that the performance characteristics of the Python code are likely way different than those of the translated C code. That's certainly the case for Cython code, running cProfile over Python code, running it over the compiled module and running callgrind over it often yields totally different results. That's why I would prefer running this through callgrind instead of Python+profile (I noticed that cProfile doesn't work either). Is there a way to get readable debugging symbols in a translated PyPy that would tell me what is being executed? Stefan

Stefan Behnel, 03.07.2012 19:56:
Actually, it did work. I just had to enable the _lsprof module. However, it now prints a trace of every C-API function that it calls, e.g. """ <function PyTuple_CheckExact at 0x42da5a0> DONE <function PyList_CheckExact at 0x43cc5a0> DONE <function PySequence_Size at 0x43e1300> DONE <function PySequence_ITEM at 0x43ef300> DONE <function PySequence_ITEM at 0x43ef300> DONE <function PySequence_ITEM at 0x43ef300> DONE <function Py_DecRef at 0x40aea38> <function subtype_dealloc at 0x4322990> <function PyObject_dealloc at 0x43466f0> <function PyObject_Del at 0x43464f8> DONE DONE DONE DONE """ Is there a way to disable that? That level of verbosity could be a bit costly. Stefan

Stefan Behnel, 03.07.2012 19:56:
Ok, I finally managed to get a profile and it turned out to be completely useless. I tried profiling inside of PyPy, but that doesn't tell me anything about cpyext. I then tried profiling the interpreted PyPy itself from CPython, but that only gives me a profile of the interpreted code, which is clearly different from what the compiled code does. It spends lots of time doing calls, for example, and the top time consuming function is "getattr", followed by "build_ctypes_array". At least "api.py:wrapper" is also somewhat up on the list, but it's just drowning in the noise. Back to that question then:
Is there a way to get readable debugging symbols in a translated PyPy that would tell me what is being executed?
Stefan

On Thu, Jul 5, 2012 at 10:26 AM, Amaury Forgeot d'Arc <amauryfa@gmail.com>wrote:
Default build (not the distribution or nightly, you have to trasnlate yourself), contains debug info. You can do make lldebug to have even more, but even without it it's usable with valgrind. Cheers, fijal

Maciej Fijalkowski, 05.07.2012 11:01:
Ah, yes. Given a suitably large machine and enough time to do other stuff, that did the trick for me. Here's the result: http://cython.org/callgrind-pypy-nbody.png As you can see, make_ref() and Py_DecRef() combined account for almost 80% of the runtime. So the expected gain from optimising the PyObject handling is *huge*. The first thing that pops up from the graph is the different calls through generic_cpy_call(). That looks way to generic for something as performance critical as Py_DecRef(). Also, what's that recursive "stackwalk()" thing doing? Stefan

On Thu, Aug 23, 2012 at 11:11 AM, Stefan Behnel <stefan_ml@behnel.de> wrote:
I took a look at it and it seems it's not what I thought it is. It's just an intermediate call that saves stack roots and then calls the actual cpyext. I don't think the call itself is harmful, it just happens to be on the callstack (always)

Maciej Fijalkowski, 23.08.2012 11:17:
Ah, ok - good to know. Then I think our next best bet is to cache the PyObject structs for PyPy objects using a weak-key dict. That will fix two problems at the same time: 1) prune excessive create-decref-dealloc cycles 2) keep the PyObject pointer valid as long as the PyPy object is alive, thus preventing crashes for code that expects an object reference in a list (for example) to be enough to keep the C object representation alive. Stefan

On Thu, Jul 5, 2012 at 2:35 PM, Stefan Behnel <stefan_ml@behnel.de> wrote:
Note that this is RPython not C. This is not a "generic_call" - this is a specialized version of generic call based on some arguments. That means that whatever could have been determined by those attributes have been constant folded away. In RPython this is very typical - you write a generic version that specializes based on some values to arrive at a specific version. Cheers, fijal

Stefan Behnel, 05.07.2012 14:35:
I've set up a job on our build server to run a couple of (simple) benchmarks comparing Cython's current performance under CPython and PyPy. Note that these are C-API intensive benchmarks by design (they are compiled from Python code), even though they use static type annotations for optimisation. Despite of what you might think about these benchmarks in general, I think they are quite suitable for cpyext. https://sage.math.washington.edu:8091/hudson/job/cython-devel-cybenchmarks-p... The latest results are here: https://sage.math.washington.edu:8091/hudson/job/cython-devel-cybenchmarks-p... They currently run 100-200x slower through cpyext than in CPython 2.7. The build job always uses the latest nightly build of PyPy, so any changes in Cython or PyPy will usually show up within the next 24 hours. The build job also archives the generated .c files (and the original sources, including the HTML version). If anyone wants to play with them, you can just download the C file, build it with distutils, import it and then call its "main(number_of_iterations)" function. The C code works in both PyPy and CPython, although the actual C-API calls differ somewhat between the two. Stefan

Thanks! i might think bad about those benchmarks representing python workloads, howecer they are very likely good for cpyext. good job. On Thursday, July 5, 2012, Stefan Behnel <stefan_ml@behnel.de> wrote:
https://sage.math.washington.edu:8091/hudson/job/cython-devel-cybenchmarks-p...
The latest results are here:
https://sage.math.washington.edu:8091/hudson/job/cython-devel-cybenchmarks-p...

And now for something completely different: Would not it make dramatically more sense to just finish the cppyy branch and get a working extension-making scheme that actually works? I have a project now that uses C++ extensions and python quite extensively, and the only thing stopping me from migrating to cppyy from SWIG is the fact that cppyy is still so unfinished and requires rebuild of pypy with root libraries and all the other tedious things, that slow down deployment. BR, Alex

Hi Alex,
cppyy is still so unfinished
provide a prioritized list of what's still missing for you? I'm following a more or less random walk otherwise, with most work going into the CINT backend at the moment.
and requires rebuild of pypy with root libraries and all the other tedious things, that slow down deployment.
Just Reflex w/o ROOT should work now, but rebuilding pypy-c is going to remain needed for a while, I'm afraid. The only idea that I have for another method that allows deferring to runtime, is to use ctypes instead of rffi for the CAPI. Best regards, Wim -- WLavrijsen@lbl.gov -- +1 (510) 486 6411 -- www.lavrijsen.net

Hi Maciej,
it's in the process. look on ffi-backend branch
looking good! Any chance of getting a W_CTypeFunc.call() taking args instead of args_w? I'd like to pull in the function pointer from app-level, so that loading in the .so's is deferred to run-time, but then do the call from interpreter level from an unwrapped function pointer with interpreter-level args, so the args need no re-wrapping. As an aside, and I know this is hard to do portably (Windows ...), but could load_library() take dl flags for dlopen? Thanks, Wim -- WLavrijsen@lbl.gov -- +1 (510) 486 6411 -- www.lavrijsen.net

2012/7/10 Eleytherios Stamatogiannakis <estama@gmail.com>:
A callback called from C cannot be jitted, unless it has a loop in Python code. The loop in the sqlite library does not count. The JIT needs a complete understanding of all operations in the loop... -- Amaury Forgeot d'Arc

Hi all, On Tue, 10 Jul 2012, Maciej Fijalkowski wrote:
if the JITted function still requires wrapped arguments, then an additional C function needs to be generated to be passed as the actual pointer, to receive the arguments and wrap them. Generating such a C function by the backend then requires a C-API for the calls to do the wrapping. That PyPy C-API is a long awaited project. :) As an aside, would it be sufficient to compile the callback (rather than JIT it)? I.e. to get code that doesn't have guards or the ability to fallback? The input args are coming from C, so the types are guaranteed, and if the code is purely numeric, with no fallback to the interpreter anywhere, then acquiring the GIL shouldn't be needed. Best regards, Wim -- WLavrijsen@lbl.gov -- +1 (510) 486 6411 -- www.lavrijsen.net

Hi Amaury, On Tue, 10 Jul 2012, Amaury Forgeot d'Arc wrote:
I'm trying to resist the urge to say that it should be the Python C-API, with all PyObject*'s replaced by PyPyObject*'s. :) For callbacks, something simpler could do. For example, the PyPyObject*'s passed in (i.e. the wrapped callback arguments), are going to be fully owned by the callback function. That way, the JIT is not blind to what happens to the objects, which should lead to better traces. Myself, I have two needs for callbacks: GUIs and fitting. For the former, performance is a non-issue. These callbacks are things like "button clicked", and its thread safety and clean error handling that's important. As an API, something like PyPyRun_SimpleString or PyPyRun_AnyFile will already do fine. The latter, which is closer to what Eleytherios wants, I think, could be self-contained. More likely, (our) users will want to implement a __call__ member function and pass the callable instance. The idea is to have access to instance-specific data. Since the instance must live somewhere before being send down the wire, its data are basically global. In terms of API, to generate the C stub, a simple PyPyObject_Call() would be enough (the tracking of the relevant PyPyObject* would be done in the bindings layer at the interpreter level; this is how it currently works in PyROOT/CPython as well). Best regards, Wim -- WLavrijsen@lbl.gov -- +1 (510) 486 6411 -- www.lavrijsen.net

2012/7/10 <wlavrijsen@lbl.gov>:
I'm trying to resist the urge to say that it should be the Python C-API, with all PyObject*'s replaced by PyPyObject*'s. :)
I don't think it's a good idea to use pointers with a moving GC. But I may be wrong. In any case, we need a memory model compatible with C. -- Amaury Forgeot d'Arc

wlavrijsen@lbl.gov, 10.07.2012 19:38:
There's no reason it would require wrapped arguments. The input types are known from the static low-level function type, so the JIT compiler can just work with them and adapt the function that is being called. These things are trickier in a non-JIT environment, but we are currently working on a general framework to support low-level calls through the CPython ecosystem (I mentioned that before). Stefan

Hi Stefan,
There's no reason it would require wrapped arguments.
that completely depends on how it is generated, of course, and in the context of calls within Python (pypy-c), it makes sense to have the entry point of the function expect wrapped arguments, and have the exit point re-wrap.
Yes, except that the above only follows after this:
I.e., there should first be a good method of delivering the low-level info. Right now, that delivery method is the act of unwrapping itself (that is, the wrapped types carry the low-level info). Best regards, Wim -- WLavrijsen@lbl.gov -- +1 (510) 486 6411 -- www.lavrijsen.net

wlavrijsen@lbl.gov, 10.07.2012 20:10:
There's no reason you can't have multiple entry points to the same code. Cython has been doing this for a while, and I'm sure PyPy's JIT does it, too.
Sure, that's why we've been working on a specification for a protocol here. Basically, the caller and the callee would have to agree on a common signature at runtime out of the given N:M set of choices. A JIT compiler would obviously prefer generating a suitable signature, given the set of signatures that the other side presents.
Right now, that delivery method is the act of unwrapping itself (that is, the wrapped types carry the low-level info).
I have no idea what you mean, but you make it sound backwards for the case of a callback. Stefan

Hi Stefan, On Tue, 10 Jul 2012, Stefan Behnel wrote:
I'm not sure whether the PyPy JIT does that, as an entry point somewhere in the middle of a compiled trace would bypass guards, and as said re-wrap at an exit point is needed (not to mention if a guard fails half-way during the trace and the code drops in the black hole). But a JIT expert should provide the details of what's possible. :)
See Anto's talk at EuroPython for a better explanation of what I'm saying (or trying to say anyway): https://ep2012.europython.eu/conference/talks/pypy-jit-under-the-hood The type information going into the traces comes from the type information that the interpreter has. I.e. from the wrapped types. This isn't backwards AFAICS, since a general Python function (callback or otherwise) can receive any type (and of course, is allowed to fail by raising an exception if it can't deal with the types). It's only once the trace has been generated, that the types are fixed for the trace (with a fallback option, or course). Now, retrofitting the callback mechanism on top of this, may very well be backwards, which is why I think we all do agree that a better handshake is needed. And if somebody could pretty please code that up for PyPy, so that I can use it. :) As-is, I could even live with CFFI funcptrs taking a wrapped tuple of args. After all, wrapping is easy and fast at the interpreter level. However, the JIT will be blind to it. Best regards, Wim -- WLavrijsen@lbl.gov -- +1 (510) 486 6411 -- www.lavrijsen.net

wlavrijsen@lbl.gov, 10.07.2012 20:37:
Ok, then in the case of a callback, the runtime could simply start with totally stupid code that packs the low-level arguments into Python arguments, attaches the low-level type information to them and then passes them into the function. The JIT should then see in its trace what types originally came in and optimise the argument packing away, shouldn't it? Stefan

Hi Stefan,
no, don't think so: the JIT works by seeing code execute (the "tracing" part of it), so with the above recipe, there's still a chicken-and-egg problem. That is, in order to JIT, the code in the callback needs to be executed, but to execute, the trace entry point is needed, but for that, the JIT needs to run ... circle. :/ I was more thinking of an automated equivalent of the toolchain: http://doc.pypy.org/en/latest/getting-started-dev.html#id13 with annotations provided programmatically. But if that worked cleanly on pypy-c, I'd have expected separable "builtin" modules already. :) Best regards, Wim -- WLavrijsen@lbl.gov -- +1 (510) 486 6411 -- www.lavrijsen.net

wlavrijsen@lbl.gov, 10.07.2012 21:02:
Depends. The initial (stupid) entry point would be generated in the moment where you assign a Python function to a C function pointer. You'd basically pass a pointer to a wrapper function that calls your Python function. The question is if a conversion from low-level types to Python types can be seen and handled by the JIT, but I'd be surprised if it couldn't, since it generates this kind of code itself anyway. It's the same as calling a previously unoptimised (Python) branch from an optimised (low-level) one. Stefan

Hi Stefan,
this is beyond my expertise, but one way of making the wrapping visible, is to generate the C stub with types, transform those in a memory block with the needed annotations, then push that block and annotations into PyPy, and do the wrapping and python call in RPython. This also by-passes the need for a C-API (other than that one API to push the memory + annotations). Sounds like something I could do today, but still feels like a square peg / round hole kind of solution, since the code to trigger that C stub generation would (in cppyy's case) start in RPython to begin with. Makes me wish I could generate RPython (after translation, that is). Best regards, Wim -- WLavrijsen@lbl.gov -- +1 (510) 486 6411 -- www.lavrijsen.net

Stefan Behnel, 03.07.2012 19:56:
Ok, so where would this have to be done? Is there a way to implement it generically in that ubiquitous space.wrap() kind of call (whatever that does internally), or would it have to be done explicitly whenever objects pass the boundary? Stefan

Hi Stefan, On Wed, Aug 29, 2012 at 10:29 PM, Stefan Behnel <stefan_ml@behnel.de> wrote:
You need to do it at the level of the boundary --- space.wrap() is at a different level. We can't slow it down by doing dictionary lookups there, because it would be a big hit of performance for the whole PyPy. Fwiw I'm now thinking about a possible rewrite of cpyext, based on (an extension of) CFFI that would let us declare a C API and implement it in Python. This would mean basically moving cpyext to a regular app-level module. Doing so in general would be useful to write C extensions or wrappers for existing libraries: they would not necessarily be using the CPython C API but instead any custom C API that is most suitable for that particular extension. For Cython, it would mean that you would be free to invent whatever API is most suitable --- starting, I'm sure, with some subset of the CPython C API, but possibly evolving over time towards something that offers better performance in the context of PyPy. A bientôt, Armin.

Hi Armin, Armin Rigo, 30.08.2012 09:10:
Understood. This may not even be that hard to do, but it's unlikely that I'll give this a try myself. The turn-over times of building PyPy for an hour or so on a remote machine (RAM!) for most non-trivial changes are just way too long to make friends with the code base. Even waiting for minutes to get an interpreted PyPy in a state where it can start building (let alone running) Cython extensions is way beyond the seconds in my usual work flow.
Assuming that's doable in a reasonable time frame (including proper JITting of the API implementation to make it reasonably fast), that sounds like a good idea. Cython already uses quite a bit of special casing for cpyext by now, so replacing at least some of the more involved code blocks with a more direct C-API call would be nice. Actually the subset of the CPython C-API that Cython uses is not even all that large. Instead, we inline and special case a lot of cases that CPython needs to broaden its API for and tend to use the most low-level functions only. So Cython should prove a suitable target for such an API implementation. Funny enough, I had to change many low-level calls in CPython (PyTuple_*) to high-level calls in cpyext (PySequence_*), both because they are substantially faster (less useless error testing) and less likely to crash (no borrowed reference handling). It also sounds like this approach would actually enable a workable test work-flow for people like me. Stefan

Hi Stefan, On Thu, Aug 30, 2012 at 10:06 AM, Stefan Behnel <stefan_ml@behnel.de> wrote:
This is a blocker, and one that the plan towards CFFI should completely fix. It's going to be so much more fun to write regular Python code and try it immediately :-) rather than hack at the interpreter and have to wait minutes for tests to run or an hour for a full pypy build. A bientôt, Armin.

Hi, On Thu, Aug 30, 2012 at 10:28 AM, Armin Rigo <arigo@tunes.org> wrote:
This is a blocker, and one that the plan towards CFFI should completely fix.
A quick but successful attempt can be found in cffi/demo/{api,pyobj}.py. api.py is a bit of a hack but lets you declare functions with the decorator @ffi.pyexport("C function signature"). This gives you Python-implemented functions that you can call back from C. They are available from the C code passed into verify(). (A possible extension of this idea would be to choose the precise name and C-level interface that we want to give to the .so; with only a few extra hacks this could let the user write pure Python CFFI code to implement a .so that can be used in the system without knowledge that it is written in Python in the first place, e.g. as a plug-in to some existing program.) pyobj.py is a quick attempt at exposing the C API I described earlier, with "object descriptors". So far the API is only sufficient for the demo in this file, which is a C function that iterates over a list and computes the sum of its items. In the first version it assumes all items are numbers that fit in an "int". In the second version it really calls the Python addition, and so supports various kinds of objects. Note how the whole API is basically defined and implemented as regular Python code. These examples should work well on both CPython and PyPy. A bientôt, Armin.

Maciej Fijalkowski, 04.07.2012 10:48:
Done. https://bitbucket.org/pypy/pypy/wiki/Speeding%20up%20cpyext Stefan
participants (7)
-
Alex Pyattaev
-
Amaury Forgeot d'Arc
-
Armin Rigo
-
Eleytherios Stamatogiannakis
-
Maciej Fijalkowski
-
Stefan Behnel
-
wlavrijsen@lbl.gov