[pypy-dev] Cython-CEP: Native dispatch through Python callables

Stefan Behnel stefan_ml at behnel.de
Sat Apr 14 09:36:00 CEST 2012


wlavrijsen at lbl.gov, 13.04.2012 22:19:
>> It's not necessarily slow because a) the intermediate function can do more
>> than just passing through data (especially in the case of Cython or Numba)
>> and b) the exception case is usually just that, an exceptional case.
> 
> interesting: under a), what other useful work can be done by the intermediate
> function?

Cython is a programming language, so you can stick anything you like into
the wrapper. Note that a lot of code is not being (re-)written specifically
for a platform (CPython/PyPy/...), or at least shouldn't be, so when
writing a wrapper as a library, you may want to put some (and sometimes a
lot of) functionality into the wrapper itself. Be it to make a C-ish
interface more comfortable or to provide a certain functionality on top of
a bare C/C++ library. Also, Cython allows you to parallelise code quite
easily based on OpenMP, another thing that is often done in wrappers for
computational code.

This discussion actually arose from the intention to interface Cython code
efficiently with Numba, which uses the LLVM to generate code at runtime.
For that, both sides need to be able to see the C level signatures of what
they call in order to bypass the Python level call overhead.


> (Yes for b), but the slowness is in having an extra layered C++ call in
> between, the one that contains the try/catch. That's at least an extra 25%
> overhead over the naked function pointer at current levels. Of course, only
> in a micro benchmark. In real life, it's irrelevant.)

IIRC, exceptions can be surprisingly expensive in C++, so I agree that it
matters for very small functions. But you'd want to inline those anyway and
avoid exceptions if at all possible.


>> Ok, I just took a look at it and it seems like the right thing to use for
>> this. Then all that's left is an efficient runtime mapping from the
>> exported signature to a libffi call specification.
> 
> It need not even be an efficient mapping: since the mapping is static for
> each function pointer, the JIT takes care of removing it (that is, it puts
> the results of the mapping inline, so the lookup code itself disappears).

We're currently discussing ways to do this in Cython as well. The code
wouldn't get removed but at least moved out of the way, so that the CPU's
branch prediction can do the right thing. That gives you about the same
performance in practice.


> Same goes for C++ overloads (with a little care): each overload that fails
> should result in a (python) exception during mapping of the arguments. The
> JIT then removes those branches from the trace, leaving only the call that
> succeeded in the optimized trace. Thus, any time spent making the selection
> of the overload efficient is mostly wasted, as that code gets completely
> removed.

A static compiler would handle that similarly.

Stefan



More information about the pypy-dev mailing list