[Cython] CEP1000: Native dispatch through callables
Dag Sverre Seljebotn
d.s.seljebotn at astro.uio.no
Wed Apr 18 23:35:42 CEST 2012
On 04/17/2012 02:24 PM, Dag Sverre Seljebotn wrote:
> On 04/13/2012 12:11 AM, Dag Sverre Seljebotn wrote:
>> Travis Oliphant recently raised the issue on the NumPy list of what
>> mechanisms to use to box native functions produced by his Numba so
>> that SciPy functions can call it, e.g. (I'm making the numba part
>> up):
>>
>> @numba # Compiles function using LLVM def f(x): return 3 * x
>>
>> print scipy.integrate.quad(f, 1, 2) # do many callbacks natively!
>>
>> Obviously, we want something standard, so that Cython functions can
>> also be called in a fast way.
>
> OK, here's the benchmark code I've written:
>
> https://github.com/dagss/cep1000
>
> Assumptions etc.:
>
> - (Very) warm cache case is tested
>
> - I compile and link libmycallable.so, libmycaller.so and ./bench; with
> -fPIC, to emulate the Python environment
>
> - I use mostly pure C but use PyTypeObject in order to get the offsets
> to tp_flags etc right (I emulate the checking that would happen on a
> PyObject* according to CEP1000).
>
> - The test function is "double f(double x) { return x * x; }
>
> - The benchmark is run in a loop J=1000000 times (and time divided by
> J). This is repeated K=10000 times and the minimum walltime of the K run
> is used. This gave very stable readings on my system.
>
> Fixing loop iterations:
>
> In the initial results I just scanned the overload list until
> NULL-termination. It seemed to me that the code generated for this
> scanning was the most important factor.
>
> Therefore I fixed the number of overloads as a known compile-time macro
> N *in the caller*. This is somewhat optimistic; however I didn't want to
> play with figuring out loop unrolling etc. at the same time, and
> hardcoding the length of the overload list sort of took that part out of
> the equation.
>
>
> Table explanation:
>
> - N: Number of overloads in list. For N=10, there's 9 non-matching
> overloads in the list before the matching 10 (but caller doesn't know
> this). For N=1, the caller knows this and optimize for a hit in the
> first entry.
>
> - MISMATCHES: If set, the caller tries 4 non-matching signatures before
> hitting the final one. If not set, only the correct signature is tried.
>
> - LIKELY: If set, a GCC likely() macro is used to expect that the
> signature matches.
>
>
> RESULTS:
>
> Direct call (and execution of!) the function in benchmark loop took 4.8 ns.
>
> An indirect dispatch through a function pointer of known type took 5.4 ns
>
> Notation below is (intern key), in ns
>
> N=1:
> MISMATCHES=False:
> LIKELY=True: 6.44 6.44
> LIKELY=False: 7.52 8.06
> MISMATCHES=True: 8.59 8.59
> N=10:
> MISMATCHES=False: 17.19 19.20
> MISMATCHES=True: 36.52 37.59
>
> To be clear, "intern" is an interned "char*" (comparison with a 64 bits
> global variable), while key is comparison of a size_t (comparison of a
> 64-bit immediate in the instruction stream).
First: My benchmarks today are a little inconsistent with earlier
results. I think I have converged now in terms of number of iterations
(higher than last time), but that doesn't explain why indirect dispatch
through function pointer is now *higher*:
Direct took 4.83 ns
Dispatch took 5.91 ns
Anyway, even if crude, hopefully this will tell us something. Order of
benchmark numbers are:
intern key get_func_intern get_func_key
where the get_func_XX versions retrieve a function pointer taking either
a single interned signature or a single key as argument (just see
mycallable.c).
In the MISMATCHES case, the get_func_XX is called 4 times with a miss
and then with the match.
N=1
- MISMATCHES=False:
--- LIKELY=True: 5.91 6.44 8.59 9.13
--- LIKELY=False: 7.52 7.52 9.13 9.13
- MISMATCHES=True: 11.28 11.28 22.56 22.56
N=10
- MISMATCHES=False: 17.18 18.80 29.75 10.74(*)
- MISMATCHES=True: 36.06 38.13 105.00 36.52
Benchmark comments:
The one marked (*) is implemented as a switch statement with known keys
compile-time. I tried shifting around the case label values a bit but
the result persists; it could just be that the compiler does a very good
job of the switch as well.
Overall comments:
Picking a winner doesn't get easier. I'll try to (make myself) get some
perspective. Thinking of extreme cases where performance matters, here's
one (sqrt is apparently 0.5 ns on my machine; sin would be 40 ns):
from numpy import sqrt, sin
cdef double f(double x):
return sqrt(x * x) # or sin(x * x)
Of course, here one could get the pointer in the module at import time.
However, here:
from numpy import sqrt
cdef double f(double x):
return np.sqrt(x * x) # or np.sin(x * x)
the __getattr__ on np sure is larger than any effect we discuss.
From the numbers above, I think I'm ready to accept the "getfuncptr"
approach penalty (1.6 ns for a direct hit, larger when the caller
accepts more signatures) as acceptable, given the added flexibility.
When you care about the 1.6 ns, you're always going to want to do early
binding anyway.
However, just as I'm convinced about interning, there appears to be two
new arguments for keys:
- For a large number of overloads with getfuncptr, it can be much
faster than interning. A 20ns difference starts to get interesting.
- PSF GSoC proposals are not public yet, but I think I can say as much
as that there's a PEP 3121 (multiple interpreter states) proposal, and
that Martin von Lowis is favourable about it. If that goes anywhere it
doesn't make interning impossible but it requires a shared C component
and changing the spec from PyBytesObject to char*. Perhaps that can be
done in a PEP-ification revision though.
Just mentioning it. I'm not sure at this point myself; I think +1 on
getfuncptr, but not sure about keys vs. interning.
Dag
More information about the cython-devel
mailing list