[Cython] CEP1000: Native dispatch through callables

Tue Apr 17 14:40:39 CEST 2012

On 04/17/2012 02:36 PM, Dag Sverre Seljebotn wrote:
> On 04/17/2012 02:24 PM, Dag Sverre Seljebotn wrote:
>> On 04/13/2012 12:11 AM, Dag Sverre Seljebotn wrote:
>>> Travis Oliphant recently raised the issue on the NumPy list of what
>>> mechanisms to use to box native functions produced by his Numba so
>>> that SciPy functions can call it, e.g. (I'm making the numba part
>>> up):
>>>
>>> @numba # Compiles function using LLVM def f(x): return 3 * x
>>>
>>> print scipy.integrate.quad(f, 1, 2) # do many callbacks natively!
>>>
>>> Obviously, we want something standard, so that Cython functions can
>>> also be called in a fast way.
>>
>> OK, here's the benchmark code I've written:
>>
>> https://github.com/dagss/cep1000
>>
>> Assumptions etc.:
>>
>> - (Very) warm cache case is tested
>>
>> - I compile and link libmycallable.so, libmycaller.so and ./bench; with
>> -fPIC, to emulate the Python environment
>>
>> - I use mostly pure C but use PyTypeObject in order to get the offsets
>> to tp_flags etc right (I emulate the checking that would happen on a
>> PyObject* according to CEP1000).
>>
>> - The test function is "double f(double x) { return x * x; }
>>
>> - The benchmark is run in a loop J=1000000 times (and time divided by
>> J). This is repeated K=10000 times and the minimum walltime of the K run
>> is used. This gave very stable readings on my system.
>>
>> Fixing loop iterations:
>>
>> In the initial results I just scanned the overload list until
>> NULL-termination. It seemed to me that the code generated for this
>> scanning was the most important factor.
>>
>> Therefore I fixed the number of overloads as a known compile-time macro
>> N *in the caller*. This is somewhat optimistic; however I didn't want to
>> play with figuring out loop unrolling etc. at the same time, and
>> hardcoding the length of the overload list sort of took that part out of
>> the equation.
>>
>>
>> Table explanation:
>>
>> - N: Number of overloads in list. For N=10, there's 9 non-matching
>> overloads in the list before the matching 10 (but caller doesn't know
>> this). For N=1, the caller knows this and optimize for a hit in the
>> first entry.
>>
>> - MISMATCHES: If set, the caller tries 4 non-matching signatures before
>> hitting the final one. If not set, only the correct signature is tried.
>>
>> - LIKELY: If set, a GCC likely() macro is used to expect that the
>> signature matches.
>>
>>
>> RESULTS:
>>
>> Direct call (and execution of!) the function in benchmark loop took
>> 4.8 ns.
>>
>> An indirect dispatch through a function pointer of known type took 5.4 ns
>>
>> Notation below is (intern key), in ns
>>
>> N=1:
>> MISMATCHES=False:
>> LIKELY=True: 6.44 6.44
>> LIKELY=False: 7.52 8.06
>> MISMATCHES=True: 8.59 8.59
>> N=10:
>> MISMATCHES=False: 17.19 19.20
>> MISMATCHES=True: 36.52 37.59
>>
>> To be clear, "intern" is an interned "char*" (comparison with a 64 bits
>> global variable), while key is comparison of a size_t (comparison of a
>> 64-bit immediate in the instruction stream).
>>
>> PRELIMINARY BENCHMARK CONCLUSION:
>>
>> Intern appears to be as fast or faster than strcmp.
>>
>> I don't know why (is the pointer offset to the global variable stored in
>> less than 64 bits in the x86-64 instruction stream? What gdb (or other)
>> commands would I use to figure that out?)
>>
>> What happens in the assembly is:
>>
>> movq (%rdi,%rax), %rax
>> movq interned_dd(%rip), %rdx
>> cmpq %rdx, (%rax)
>> jne .L3
>>
>> vs.
>>
>> movabsq $20017697242043, %rdx
>> movq (%rdi,%rax), %rax
>> cmpq %rdx, (%rax)
>> jne .L6

OK, silly to quote the assembly for the case where they perform the same 
(LIKELY=True). Changing LIKELY=False, the jne changes to je both places, 
and strcmp performance drops relatively to intern.

Dag

>>
>>
>> TODO:
>>
>> The caller tried, for each entry in the overload list, to match all the
>> signatures. Changing the order of these loops should also be tried.
>
> One more data-point:
>
> When comparing a 96-bit key directly, the fastest benchmark for keys
> (N=1, MISMATCHES=False, LIKELY=True) grows from 6.44 to 6.98 nsec. (It
> should perform relatively better when N>1 unless prefixes match)
>
> 448-bit key is 8.59 ns.
>
> Dag
> _______________________________________________
> cython-devel mailing list
> cython-devel at python.org
> http://mail.python.org/mailman/listinfo/cython-devel