[Cython] CEP1000: Native dispatch through callables

Fri Apr 13 19:26:34 CEST 2012

On Fri, Apr 13, 2012 at 5:48 AM, mark florisson
<markflorisson88 at gmail.com> wrote:
> On 13 April 2012 12:59, Dag Sverre Seljebotn <d.s.seljebotn at astro.uio.no> wrote:
>> On 04/13/2012 01:38 PM, Stefan Behnel wrote:
>>>
>>> Robert Bradshaw, 13.04.2012 12:17:
>>>>
>>>> On Fri, Apr 13, 2012 at 1:52 AM, Dag Sverre Seljebotn wrote:
>>>>>
>>>>> On 04/13/2012 01:38 AM, Robert Bradshaw wrote:
>>>>>>
>>>>>> Have you given any thought as to what happens if __call__ is
>>>>>> re-assigned for an object (or subclass of an object) supporting this
>>>>>> interface? Or is this out of scope?
>>>>>
>>>>>
>>>>> Out-of-scope, I'd say. Though you can always write an object that
>>>>> detects if
>>>>> you assign to __call__...
>>>
>>>
>>> +1 for out of scope. This is a pure C level feature.
>>>
>>>
>>>>>> Minor nit: I don't think should_dereference is worth branching on, if
>>>>>> one wants to save the allocation one can still use a variable-sized
>>>>>> type and point to oneself. Yes, that's an extra dereference, but the
>>>>>> memory is already likely close and it greatly simplifies the logic.
>>>>>> But I could be wrong here.
>>>>>
>>>>>
>>>>>
>>>>> Those minor nits are exactly what I seek; since Travis will have the
>>>>> first
>>>>> implementation in numba<->SciPy, I just want to make sure that what he
>>>>> does
>>>>> will work efficiently work Cython.
>>>>
>>>>
>>>> +1
>>>>
>>>> I have to admit building/invoking these var-arg-sized __nativecall__
>>>> records seems painful. Here's another suggestion:
>>>>
>>>> struct {
>>>>     void* pointer;
>>>>     size_t signature; // compressed binary representation, 95% coverage
>>
>>
>> Once you start passing around functions that take memory view slices as
>> arguments, that 95% estimate will be off I think.
>>
>
> It kind of depends on which arguments types and how many arguments you
> will allow, and whether or not collisions would be fine (which would
> imply ID comparison + strcmp()).

Interesting idea, though this has the drawback of doubling (at least)
the overhead of the simple (important) case as well as memory
requirements/locality issues.

>>>>     char* long_signature; // used if signature is not representable in
>>>> a size_t, as indicated by signature = 0
>>>> } record;
>>>>
>>>> These char* could optionally be allocated at the end of the record*
>>>> for optimal locality. We could even dispense with the binary
>>>> signature, but having that option allows us to avoid strcmp for stuff
>>>> like d)d and ffi)f.
>>>
>>>
>>> Assuming we use literals and a const char* for the signature, the C
>>> compiler would cut down the number of signature strings automatically for
>>> us. And a pointer comparison is the same as a size_t comparison.
>>
>>
>> I'll go one further: Intern Python bytes objects. It's just a PyObject*, but
>> it's *required* (or just strongly encouraged) to have gone through
>>
>> sig = sys.modules['_nativecall']['interned_db'].setdefault(sig, sig)
>>
>> Obviously in a PEP you'd have a C-API function for such interning
>> (completely standalone utility). Performance of interning operation itself
>> doesn't matter...
>>
>> Unless CPython has interning features itself, like in Java? Was that present
>> back in the day and then ripped out?
>>
>> Requiring interning is somewhat less elegant in one way, but it makes a lot
>> of other stuff much simpler.
>>
>> That gives us
>>
>> struct {
>>    void *pointer;
>>    PyBytesObject *signature;
>> } record;
>>
>> and then you allocate a NULL-terminated arrays of these for all the
>> overloads.
>>
>
> Interesting. What I like about size_t it that it could define a
> deterministic ordering, which means specializations could be stored in
> a binary search tree in array form.

I think the number of specializations would have to be quite large
(>10, maybe 100) before a binary search wins out over a simple scan,
but if we stored a count rather than did a null-terminated array teh
lookup function could take this into account. (The header will already
have plenty of room if we're storing a version number and want the
records to be properly aligned.)

Requiring them to be sorted would also allow us to abort on average
half way through a scan. Of course prioritizing the "likely"
signatures first may be more of a win.

> Cython would precompute the size_t
> for the specialization it needs (and maybe account for promotions as
> well).

Exactly.

>>> That would only apply at a per-module level, though, so it would require
>>> an
>>> indirection for the signature IDs. But it would avoid a global registry.
>>>
>>> Another idea would be to set the signature ID field to 0 at the beginning
>>> and call a C-API function to let the current runtime assign an ID>  0,
>>> unique for the currently running application. Then every user would only
>>> have to parse the signature once to adapt to the respective ID and could
>>> otherwise branch based on it directly.
>>>
>>> For Cython, we could generate a static ID variable for each typed call
>>> that
>>> we found in the sources. When encountering a C signature on a callable,
>>> either a) the ID variable is still empty (initial case), then we parse the
>>> signature to see if it matches the expected signature. If it does, we
>>> assign the corresponding ID to the static ID variable and issue a direct
>>> call. If b) the ID field is already set (normal case), we compare the
>>> signature IDs directly and issue a C call it they match. If the IDs do not
>>> match, we issue a normal Python call.
>>>
>>>
>>>>> Right... if we do some work to synchronize the types for Cython modules
>>>>> generated by the same version of Cython, we're left with 3-4 types for
>>>>> Cython, right? Then a couple for numba and one for f2py; so on the order
>>>>> of
>>>>> 10?
>>>>
>>>>
>>>> No, I think each closure is its own type.
>>>
>>>
>>> And that even applies to fused functions, right? They'd have one closure
>>> for each type combination.
>>>
>>>
>>>>> An alternative is do something funny in the type object to get across
>>>>> the
>>>>> offset-in-object information (abusing the docstring, or introduce our
>>>>> own
>>>>> flag which means that the type object has an additional non-standard
>>>>> field
>>>>> at the end).
>>>>
>>>>
>>>> It's a hack, but the flag + non-standard field idea might just work...
>>>
>>>
>>> Plus, it wouldn't have to stay a non-standard field. If it's accepted into
>>> CPython 3.4, we could safely use it in all existing versions of CPython.
>>
>>
>> Sounds good. Perhaps just find a single "extended", then add a new flag
>> field in our payload, in case we need to extend the types object yet again
>> later and run out of unused flag bits (TBD: figure out how many unused flag
>> bits there are).
>>
>> Dag
>>
>> _______________________________________________
>> cython-devel mailing list
>> cython-devel at python.org
>> http://mail.python.org/mailman/listinfo/cython-devel
>
> Maybe it would be a good idea if there was a third project that
> defined this functionality in header files which projects could
> include (or in case of Cython directly inject into the generated C
> files). E.g. a function to check for the native interface, and a
> function that given an array of signature strings and function
> pointers builds the ABI information (and computes the ID), and one
> that given an ID and signature string finds the right specialization.
> The project should also expose a simple type system for the types we
> care about, and be able to generate signature strings and IDs for
> signatures.
>
> An optimization for the common case would be to only look at the first
> entry in the ABI information directly and compare that for the
> non-overloaded case, and otherwise do a logarithmic lookup, with a
> final fallback to calling through the Python layer.

I think the ABI should be simple (and fully specified) enough to allow
a trivial implementation, and we and others could ship our
implementations as a tiny C library (or just a header file).

- Robert