[Cython] CEP1000: Native dispatch through callables

Tue Apr 17 14:24:50 CEST 2012

On 04/13/2012 12:11 AM, Dag Sverre Seljebotn wrote:
> Travis Oliphant recently raised the issue on the NumPy list of what
> mechanisms to use to box native functions produced by his Numba so
> that SciPy functions can call it, e.g. (I'm making the numba part
> up):
>
> @numba # Compiles function using LLVM def f(x): return 3 * x
>
> print scipy.integrate.quad(f, 1, 2) # do many callbacks natively!
>
> Obviously, we want something standard, so that Cython functions can
> also be called in a fast way.

OK, here's the benchmark code I've written:

https://github.com/dagss/cep1000

Assumptions etc.:

  - (Very) warm cache case is tested

  - I compile and link libmycallable.so, libmycaller.so and ./bench; with
-fPIC, to emulate the Python environment

  - I use mostly pure C but use PyTypeObject in order to get the offsets
to tp_flags etc right (I emulate the checking that would happen on a
PyObject* according to CEP1000).

  - The test function is "double f(double x) { return x * x; }

  - The benchmark is run in a loop J=1000000 times (and time divided by
J). This is repeated K=10000 times and the minimum walltime of the K run
is used. This gave very stable readings on my system.

Fixing loop iterations:

In the initial results I just scanned the overload list until
NULL-termination. It seemed to me that the code generated for this
scanning was the most important factor.

Therefore I fixed the number of overloads as a known compile-time macro
N *in the caller*. This is somewhat optimistic; however I didn't want to
play with figuring out loop unrolling etc. at the same time, and
hardcoding the length of the overload list sort of took that part out of
the equation.

Table explanation:

  - N: Number of overloads in list. For N=10, there's 9 non-matching
overloads in the list before the matching 10 (but caller doesn't know
this). For N=1, the caller knows this and optimize for a hit in the
first entry.

  - MISMATCHES: If set, the caller tries 4 non-matching signatures before
hitting the final one. If not set, only the correct signature is tried.

  - LIKELY: If set, a GCC likely() macro is used to expect that the 
signature matches.

RESULTS:

Direct call (and execution of!) the function in benchmark loop took 4.8 ns.

An indirect dispatch through a function pointer of known type took 5.4 ns

Notation below is (intern key), in ns

N=1:
   MISMATCHES=False:
     LIKELY=True:     6.44  6.44
     LIKELY=False:    7.52  8.06
   MISMATCHES=True:   8.59  8.59
N=10:
   MISMATCHES=False:  17.19  19.20
   MISMATCHES=True:   36.52  37.59

To be clear, "intern" is an interned "char*" (comparison with a 64 bits 
global variable), while key is comparison of a size_t (comparison of a 
64-bit immediate in the instruction stream).

PRELIMINARY BENCHMARK CONCLUSION:

Intern appears to be as fast or faster than strcmp.

I don't know why (is the pointer offset to the global variable stored in 
less than 64 bits in the x86-64 instruction stream? What gdb (or other) 
commands would I use to figure that out?)

What happens in the assembly is:

	movq	(%rdi,%rax), %rax
	movq	interned_dd(%rip), %rdx
	cmpq	%rdx, (%rax)
	jne	.L3

vs.

	movabsq	$20017697242043, %rdx
	movq	(%rdi,%rax), %rax
	cmpq	%rdx, (%rax)
	jne	.L6

TODO:

The caller tried, for each entry in the overload list, to match all the 
signatures. Changing the order of these loops should also be tried.

Dag