[Cython] CEP1000: Native dispatch through callables

Thu Apr 19 13:05:51 CEST 2012

On 04/19/2012 12:56 PM, Dag Sverre Seljebotn wrote:
> On 04/19/2012 11:07 AM, Nathaniel Smith wrote:
>> On Wed, Apr 18, 2012 at 10:58 PM, Dag Sverre Seljebotn
>> <d.s.seljebotn at astro.uio.no> wrote:
>>> On 04/18/2012 11:35 PM, Dag Sverre Seljebotn wrote:
>>>>
>>>> On 04/17/2012 02:24 PM, Dag Sverre Seljebotn wrote:
>>>>>
>>>>> On 04/13/2012 12:11 AM, Dag Sverre Seljebotn wrote:
>>>>>>
>>>>>> Travis Oliphant recently raised the issue on the NumPy list of what
>>>>>> mechanisms to use to box native functions produced by his Numba so
>>>>>> that SciPy functions can call it, e.g. (I'm making the numba part
>>>>>> up):
>>>>>>
>>>>>> @numba # Compiles function using LLVM def f(x): return 3 * x
>>>>>>
>>>>>> print scipy.integrate.quad(f, 1, 2) # do many callbacks natively!
>>>>>>
>>>>>> Obviously, we want something standard, so that Cython functions can
>>>>>> also be called in a fast way.
>>>>>
>>>>>
>>>>> OK, here's the benchmark code I've written:
>>>>>
>>>>> https://github.com/dagss/cep1000
>>>>>
>>>>> Assumptions etc.:
>>>>>
>>>>> - (Very) warm cache case is tested
>>>>>
>>>>> - I compile and link libmycallable.so, libmycaller.so and ./bench;
>>>>> with
>>>>> -fPIC, to emulate the Python environment
>>>>>
>>>>> - I use mostly pure C but use PyTypeObject in order to get the offsets
>>>>> to tp_flags etc right (I emulate the checking that would happen on a
>>>>> PyObject* according to CEP1000).
>>>>>
>>>>> - The test function is "double f(double x) { return x * x; }
>>>>>
>>>>> - The benchmark is run in a loop J=1000000 times (and time divided by
>>>>> J). This is repeated K=10000 times and the minimum walltime of the
>>>>> K run
>>>>> is used. This gave very stable readings on my system.
>>>>>
>>>>> Fixing loop iterations:
>>>>>
>>>>> In the initial results I just scanned the overload list until
>>>>> NULL-termination. It seemed to me that the code generated for this
>>>>> scanning was the most important factor.
>>>>>
>>>>> Therefore I fixed the number of overloads as a known compile-time
>>>>> macro
>>>>> N *in the caller*. This is somewhat optimistic; however I didn't
>>>>> want to
>>>>> play with figuring out loop unrolling etc. at the same time, and
>>>>> hardcoding the length of the overload list sort of took that part
>>>>> out of
>>>>> the equation.
>>>>>
>>>>>
>>>>> Table explanation:
>>>>>
>>>>> - N: Number of overloads in list. For N=10, there's 9 non-matching
>>>>> overloads in the list before the matching 10 (but caller doesn't know
>>>>> this). For N=1, the caller knows this and optimize for a hit in the
>>>>> first entry.
>>>>>
>>>>> - MISMATCHES: If set, the caller tries 4 non-matching signatures
>>>>> before
>>>>> hitting the final one. If not set, only the correct signature is
>>>>> tried.
>>>>>
>>>>> - LIKELY: If set, a GCC likely() macro is used to expect that the
>>>>> signature matches.
>>>>>
>>>>>
>>>>> RESULTS:
>>>>>
>>>>> Direct call (and execution of!) the function in benchmark loop took
>>>>> 4.8 ns.
>>>>>
>>>>> An indirect dispatch through a function pointer of known type took
>>>>> 5.4 ns
>>>>>
>>>>> Notation below is (intern key), in ns
>>>>>
>>>>> N=1:
>>>>> MISMATCHES=False:
>>>>> LIKELY=True: 6.44 6.44
>>>>> LIKELY=False: 7.52 8.06
>>>>> MISMATCHES=True: 8.59 8.59
>>>>> N=10:
>>>>> MISMATCHES=False: 17.19 19.20
>>>>> MISMATCHES=True: 36.52 37.59
>>>>>
>>>>> To be clear, "intern" is an interned "char*" (comparison with a 64
>>>>> bits
>>>>> global variable), while key is comparison of a size_t (comparison of a
>>>>> 64-bit immediate in the instruction stream).
>>>>
>>>>
>>>> First: My benchmarks today are a little inconsistent with earlier
>>>> results. I think I have converged now in terms of number of iterations
>>>> (higher than last time), but that doesn't explain why indirect dispatch
>>>> through function pointer is now *higher*:
>>>>
>>>> Direct took 4.83 ns
>>>> Dispatch took 5.91 ns
>>>>
>>>> Anyway, even if crude, hopefully this will tell us something. Order of
>>>> benchmark numbers are:
>>>>
>>>> intern key get_func_intern get_func_key
>>>>
>>>> where the get_func_XX versions retrieve a function pointer taking
>>>> either
>>>> a single interned signature or a single key as argument (just see
>>>> mycallable.c).
>>>>
>>>> In the MISMATCHES case, the get_func_XX is called 4 times with a miss
>>>> and then with the match.
>>>>
>>>> N=1
>>>> - MISMATCHES=False:
>>>> --- LIKELY=True: 5.91 6.44 8.59 9.13
>>>> --- LIKELY=False: 7.52 7.52 9.13 9.13
>>>> - MISMATCHES=True: 11.28 11.28 22.56 22.56
>>>>
>>>> N=10
>>>> - MISMATCHES=False: 17.18 18.80 29.75 10.74(*)
>>>> - MISMATCHES=True: 36.06 38.13 105.00 36.52
>>>>
>>>> Benchmark comments:
>>>>
>>>> The one marked (*) is implemented as a switch statement with known keys
>>>> compile-time. I tried shifting around the case label values a bit but
>>>> the result persists; it could just be that the compiler does a very
>>>> good
>>>> job of the switch as well.
>>>
>>>
>>> I should make this clearer: The issue is that the compiler may have
>>> reordered the labels so that the hit came close to first; in the
>>> intern case
>>> the code is written so that the hit is always after 9 mismatches.
>>>
>>> So I redid the (*) test using 10 cases with very different numeric
>>> values,
>>> and then tried each 10 as the matching case. Timings were stable for
>>> each
>>> choice of label (so this is not noise), with values:
>>>
>>> 13.4 11.8 11.8 12.3 10.7 11.2 12.3
>>>
>>> Guess this is the binary decision tree Mark talked about...
>>
>> Yes, if you look at the ASM (this is easier to keep track of if you
>> make the switch cases into round decimal numbers, like 1000, 2000,
>> 3000...), then you can see that gcc is generating a fully unrolled
>> binary search, as basically a set of nested if/else's, like:
>>
>> if (value< 5000) {
>> if (value == 2000) {
>> return&ptr_2000;
>> } else if (value == 4000) {
>> return&ptr_4000;
>> }
>> } else {
>> if (value == 6000) {
>> return&ptr_6000;
>> } else if (value == 8000) {
>> return&ptr_8000;
>> }
>> }
>>
>> I suppose if we're ambitious we could do the same with the intern
>> table for Cython compile-time variants (we don't know the values ahead
>> of time, but we know how many there will be, so we'd generate the list
>> of intern values, sort it, and then replace the comparison values
>> above with table[middle], etc.).
>
> Right. With everything being essentially equal, this isn't getting easier.
>
> I thought of some drawbacks of getfuncptr:
>
> - Important: Doesn't allow you to actually inspect the supported
> signatures, which is needed (or at least convenient) if you want to use
> an FFI library or do some JIT-ing. So an iteration mechanism is still
> needed in addition, meaning the number of things for the object to
> implement grows a bit large. Default implementations help -- OTOH there
> really wasn't a major drawback with the table approach as long as JIT's
> can just replace it?
>
> - Minor: I've read that on Intel's Sandy Bridge, the micro-ops are
> cached after instruction decoding, and that micro-ops cache is so
> precious (and decoding so expensive) that the recommendation is no loop
> unrolling at all! So essentially sticking the table in unrolled
> instructions may not continue to be a good idea. (Of course, getfuncptr
> doesn't ).

... (Of course, getfuncptr doesn't force you to do that, you could keep 
traversing data, but without getfuncptr you are forced to take a 
table-driven approach which may be better for the micro-op cache. Pure 
speculation though.).

Dag