[Cython] CEP1000: Native dispatch through callables

Sat May 5 18:27:11 CEST 2012

On 05/05/2012 01:08 PM, mark florisson wrote:
> On 3 May 2012 13:24, Dag Sverre Seljebotn<d.s.seljebotn at astro.uio.no>  wrote:
>> I'm afraid I'm going to try to kick this thread alive again. I want us to
>> have something that Travis can implement in numba and "his" portion of
>> SciPy, and also that could be used by NumPy devs.
>>
>> Since the decisions are rather arbitrary, perhaps we can try to quickly get
>> to the "+1" stage (or, depending on how things turn out, a tournament
>> starting with at most one proposal per person).
>>
>>
>> On 04/20/2012 09:30 AM, Robert Bradshaw wrote:
>>>
>>> On Thu, Apr 19, 2012 at 6:18 AM, Dag Sverre Seljebotn
>>> <d.s.seljebotn at astro.uio.no>    wrote:
>>>>
>>>> On 04/19/2012 01:20 PM, Nathaniel Smith wrote:
>>>>>
>>>>>
>>>>> On Thu, Apr 19, 2012 at 11:56 AM, Dag Sverre Seljebotn
>>>>> <d.s.seljebotn at astro.uio.no>      wrote:
>>>>>>
>>>>>>
>>>>>> I thought of some drawbacks of getfuncptr:
>>>>>>
>>>>>>   - Important: Doesn't allow you to actually inspect the supported
>>>>>> signatures, which is needed (or at least convenient) if you want to use
>>>>>> an
>>>>>> FFI library or do some JIT-ing. So an iteration mechanism is still
>>>>>> needed
>>>>>> in
>>>>>> addition, meaning the number of things for the object to implement
>>>>>> grows
>>>>>> a
>>>>>> bit large. Default implementations help -- OTOH there really wasn't a
>>>>>> major
>>>>>> drawback with the table approach as long as JIT's can just replace it?
>>>>>
>>>>>
>>>>>
>>>>> But this is orthogonal to the table vs. getfuncptr discussion. We're
>>>>> assuming that the table might be extended at runtime, which means you
>>>>> can't use it to determine which signatures are supported. So we need
>>>>> some sort of extra interface for the caller and callee to negotiate a
>>>>> type anyway. (I'm intentionally agnostic about whether it makes more
>>>>> sense for the caller or the callee to be doing the iterating... in
>>>>> general type negotiation could be quite complicated, and I don't think
>>>>> we know enough to get that interface right yet.)
>>>>
>>>>
>>>>
>>>> Hmm. Right. Let's define an explicit goal for the CEP then.
>>>>
>>>> What I care about at is getting the spec right enough such that, e.g.,
>>>> NumPy
>>>> and SciPy, and other (mostly manually written) C extensions with slow
>>>> development pace, can be forward-compatible with whatever crazy things
>>>> Cython or Numba does.
>>>>
>>>> There's 4 cases:
>>>>
>>>>   1) JIT calls JIT (ruled out straight away)
>>>>
>>>>   2) JIT calls static: Say that Numba wants to optimize calls to np.sin
>>>> etc.
>>>> without special-casing; this seem to require reading a table of static
>>>> signatures
>>>>
>>>>   3) Static calls JIT: This is the case when scipy.integrate routines
>>>> calls a
>>>> Numba callback and Numba generates a specialization for the dtype they
>>>> explicitly needs. This calls for getfuncptr (but perhaps in a form which
>>>> we
>>>> can't quite determine yet?).
>>>>
>>>>   4) Static calls static: Either table or getfuncptr works.
>>>>
>>>> My gut feeling is go for 2) and 4) in this round =>    table.
>>>
>>>
>>> getfuncptr is really simple and flexible, but I'm with you on both of
>>> these to points, and the overhead was not trivial.
>>
>>
>> It's interesting to hear you say the overhead was not trivial (that was my
>> hunch too but I sort of yielded to peer pressure). I think SAGE has some
>> history with this -- isn't one of the reasons for the "cpdef" vs. "cdef"
>> split that "cpdef" has the cost of a single lookup for the presence of a
>> __dict__ on the object, which was an unacceptable penalty for parts of Sage?
>> That can't have been much more than a 1ns penalty per instance.
>>
>>
>>> Of course we could offer both, i.e. look at the table first, if it's
>>> not there call getfuncptr if it's non-null, then fall back to "slow"
>>> call or error. These are all opt-in depending on how hard you want to
>>> try to optimize things.
>>
>>
>> That's actually exactly what I was envisioning -- in time (with JITs on both
>> ends) the table could act sort of as a cache for commonly used overloads,
>> and getfuncptr would access the others more slowly.
>>
>>
>>> As far as keys vs. interning, I'm also tempted to try to have my cake
>>> and eat it too. Define a space-friendly encoding for signatures and
>>> require interning for anything that doesn't fit into a single
>>> sizeof(void*). The fact that this cutoff would vary for 32 vs 64-bit
>>> would require some care, but could be done with macros in C. If the
>>> signatures produce non-aligned "pointer" values there won't be any
>>> collisions, and this way libraries only have to share in the global
>>> (Python-level?) interning scheme iff they want to expose/use "large"
>>> signatures.
>>
>>
>> That was the approach I described to Nathaniel as having the "worst features
>> of both" -- lack of readable gdb dumps of the keys, and having to define an
>> interning mechanism for use by the 5% cases that don't fit.
>>
>> To sum up hat's been said earlier: The only thing that would blow the key
>> size above 64 bits except very many arguments would be things like
>> classes/interfaces/vtables. But in that case, reasonable-sized keys for the
>> vtables can be computed (whether by interning, cryptographic hashing, or a
>> GUID like Microsoft COM).
>>
>> So I'm still +1 on my proposal; but I would be happy with an intern-based
>> proposal if somebody bothers to flesh it out a bit (I don't quite know how
>> I'd do it and would get lost in PyObject* vs. char* and cross-language state
>> sharing...).
>>
>> My proposal in summary:
>>
>>   - Table with variable-sized entries (not getfuncptr, not interning) that
>> can be scanned by the caller in 128-bit increments.
>
> Hm, so the caller knows what kind of key it needs to compare to, so if
> it has a 64 bits key then it won't need to compare 128 bits (padded
> with zeroes?). But if it doesn't compare 128 bits, then it means 128
> bit keys cannot have 64 bit keys as prefix. Will that be a problem, or

Did you read the CEP? I also clarified this in a post in response to 
Nathaniel. The idea is that the scanner doesn't need to branch on the 
key-length anywhere. This requires a) making each key n*64 bits long 
where n is odd => function pointers are always at (m*128 + 64) bits from 
the start for some non-negative integer m, b) insert some protective 
prefix for every 128 bits in the key.

> would it make sense to make the first entry a pointer pointing to 128
> bit keys, and the rest are all 64 bit keys (or even 32 bit keys and
> two pointers)? e.g. a contiguous list of [128 bit key/pointer
> list-pointer, 64-bit keys&  func pointers, 128 bit keys&  func
> pointers, NULL]

I don't really understand this description, but in general I'm sceptical 
about the pipelining abilities of pointer-chasing code. It may be OK, 
but it would require a benchmark, and if there's not a reason to have it...

> Even with a naive encoding scheme you could encode 3 scalar arguments
> and a return value in 32 bits (e.g. 'dddd'). That might be better on
> x86?

Me and Robert have been assuming some non-ASCII encoding that would 
allow many more arguments in 64 bits.

>
>>   - Only use 64 bit pointers, in order to keep table format the same on 32
>> bit and 64 bit.
>
> Pointer to the function? I think that would only be harder to use than
> native pointers?

That was to make the multiple-of-128-bit-entry idea work without having 
to require that keys are different between 32 bits and 64 bits platforms.

Dag

>>   - Do encoding of the signature strings. Utility functions to work with this
>> (both to scan tables and encode/decode a format string) will be provided as
>> C code by the CEP that can be bundled.
>>
>> Pros:
>>
>>   - Table format is not specific to Python world (it makes as much sense to
>> use, e.g., internally in Julia)
>>
>>   - No state needs to be shared between packages run-time (they can use the
>> bundled C code in isolation if they wish)
>>
>>   - No need for an interning machinery
>>
>>   - More easily compatible with multiple interpreter states (?)
>>
>>   - Minor performance benefit of table over getfuncptr (intern vs. key didn't
>> matter). [Cue comment that this doesn't matter.]
>>
>> Cons:
>>
>>   - Lack of instant low-level debuggability, like in the interned case (a
>> human needs to run a function on the key constant to see what it corresponds
>> to)
>>
>>   - Not as extendable as getfuncptr (though currently we don't quite know how
>> we would extend it, and it's easy to add getfuncptr in the future)
>>
>> Notes:
>>
>>   - When extended to handle vtable argument types, these still needs to be
>> some interning or crypto-hashing. But that is likely to come up anyway as
>> part of a COM-like queryInterface protocol, and at that point we will be
>> better at making those decisions and design a good interning mechanism.
>>
>> Dag
>>
>> _______________________________________________
>> cython-devel mailing list
>> cython-devel at python.org
>> http://mail.python.org/mailman/listinfo/cython-devel
> _______________________________________________
> cython-devel mailing list
> cython-devel at python.org
> http://mail.python.org/mailman/listinfo/cython-devel