[Cython] CEP1000: Native dispatch through callables

Fri Apr 13 22:27:29 CEST 2012

Ah, I didn't think about 6-bit or huffman. Certainly helps.

I'm almost +1 on your proposal now, but a couple of more ideas:

1) Let the key (the size_t) spill over to the next specialization entry if it is too large; and prepend that key with a continuation code (two size-ts could together say "iii)-d\0\0" on 32 bit systems with 8bit encoding, using - as continuation). The key-based caller will expect a continuation if it knows about the specialization, and the prepended char will prevent spurios matches against the overspilled slot.

We could even use the pointers for part of the continuation...

2) Separate the char* format strings from the keys, ie this memory layout:

Version,nslots,nspecs,funcptr,key,funcptr,key,...,sigcharptr,sigcharptr...

Where nslots is larger than nspecs if there are continuations.

OK, this is getting close to my original proposal, but the difference is the contiunation char, so that if you expect a short signature, you can safely scan every slot and branching and no null-checking necesarry.

Dag
-- 
Sent from my Android phone with K-9 Mail. Please excuse my brevity.

Robert Bradshaw <robertwb at gmail.com> wrote:

On Fri, Apr 13, 2012 at 4:59 AM, Dag Sverre Seljebotn
<d.s.seljebotn at astro.uio.no> wrote:
> On 04/13/2012 01:38 PM, Stefan Behnel wrote:
>>
>> Robert Bradshaw, 13.04.2012 12:17:
>>>
>>> On Fri, Apr 13, 2012 at 1:52 AM, Dag Sverre Seljebotn wrote:
>>>>
>>>> On 04/13/2012 01:38 AM, Robert Bradshaw wrote:
>>>>>
>>>>> Have you given any thought as to what happens if __call__ is
>>>>> re-assigned for an object (or subclass of an object) supporting this
>>>>> interface? Or is this out of scope?
>>>>
>>>>
>>>> Out-of-scope, I'd say. Though you can always write an object that
>>>> detects if
>>>> you assign to __call__...
>>
>>
>> +1 for out of scope. This is a pure C level feature.
>>
>>
>>>>> Minor nit: I don't think should_dereference is worth branching on, if
>>>>> one wants to save the allocation one can still use a variable-sized
>>>>> type and point to oneself. Yes, that's an extra dereference, but the
>>>>> memory is already likely close and it greatly simplifies the logic.
>>>>> But I could be wrong here.
>>>>
>>>>
>>>>
>>>> Those minor nits are exactly what I seek; since Travis will have the
>>>> first
>>>> implementation in numba<->SciPy, I just want to make sure that what he
>>>> does
>>>> will work efficiently work Cython.
>>>
>>>
>>> +1
>>>
>>> I have to admit building/invoking these var-arg-sized __nativecall__
>>> records seems painful. Here's another suggestion:
>>>
>>> struct {
>>>     void* pointer;
>>>     size_t signature; // compressed binary representation, 95% coverage
>
> Once you start passing around functions that take memory view slices as
> arguments, that 95% estimate will be off I think.

We have (on the high-performance systems we care about) 64-bits here.
If we limit ourselves to a 6-bit alphabet, that gives a trivial
encoding for up to 10 chars. We could be more clever here (Huffman
coding) but that might be overkill. More importantly though, the
"complicated" signatures are likely to be so cheap that the strcmp
overhead matters.

>>>     char* long_signature; // used if signature is not representable in
>>> a size_t, as indicated by signature = 0
>>> } record;
>>>
>>> These char* could optionally be allocated at the end of the record*
>>> for optimal locality. We could even dispense with the binary
>>> signature, but having that option allows us to avoid strcmp for stuff
>>> like d)d and ffi)f.
>>
>>
>> Assuming we use literals and a const char* for the signature, the C
>> compiler would cut down the number of signature strings automatically for
>> us. And a pointer comparison is the same as a size_t comparison.
>
>
> I'll go one further: Intern Python bytes objects. It's just a PyObject*, but
> it's *required* (or just strongly encouraged) to have gone through
>
> sig = sys.modules['_nativecall']['interned_db'].setdefault(sig, sig)
>
> Obviously in a PEP you'd have a C-API function for such interning
> (completely standalone utility). Performance of interning operation itself
> doesn't matter...
>
> Unless CPython has interning features itself, like in Java? Was that present
> back in the day and then ripped out?
>
> Requiring interning is somewhat less elegant in one way, but it makes a lot
> of other stuff much simpler.
>
> That gives us
>
> struct {
>    void *pointer;
>    PyBytesObject *signature;
> } record;
>
> and then you allocate a NULL-terminated arrays of these for all the
> overloads.

Global interning is a nice idea. The one drawback I see is that it
becomes much more expensive for dynamically calculated signatures.

>>
>> That would only apply at a per-module level, though, so it would require
>> an
>> indirection for the signature IDs. But it would avoid a global registry.
>>
>> Another idea would be to set the signature ID field to 0 at the beginning
>> and call a C-API function to let the current runtime assign an ID>  0,
>> unique for the currently running application. Then every user would only
>> have to parse the signature once to adapt to the respective ID and could
>> otherwise branch based on it directly.
>>
>> For Cython, we could generate a static ID variable for each typed call
>> that
>> we found in the sources. When encountering a C signature on a callable,
>> either a) the ID variable is still empty (initial case), then we parse the
>> signature to see if it matches the expected signature. If it does, we
>> assign the corresponding ID to the static ID variable and issue a direct
>> call. If b) the ID field is already set (normal case), we compare the
>> signature IDs directly and issue a C call it they match. If the IDs do not
>> match, we issue a normal Python call.

If I understand correctly, you're proposing

struct {
char* sig;
long id;
} sig_t;

Where comparison would (sometimes?) compute id from sig by augmenting
a global counter and dict? Might be expensive to bootstrap, but
eventually all relevant ids would be filled in and it would be quick.
Interesting. I wonder what the performance penalty would be over
assuming id is statically computed lots of the time, and using that to
compare against fixed values. And there's memory locality issues as
well.

>>>> Right... if we do some work to synchronize the types for Cython modules
>>>> generated by the same version of Cython, we're left with 3-4 types for
>>>> Cython, right? Then a couple for numba and one for f2py; so on the order
>>>> of
>>>> 10?
>>>
>>>
>>> No, I think each closure is its own type.
>>
>>
>> And that even applies to fused functions, right? They'd have one closure
>> for each type combination.
>>
>>
>>>> An alternative is do something funny in the type object to get across
>>>> the
>>>> offset-in-object information (abusing the docstring, or introduce our
>>>> own
>>>> flag which means that the type object has an additional non-standard
>>>> field
>>>> at the end).
>>>
>>>
>>> It's a hack, but the flag + non-standard field idea might just work...
>>
>>
>> Plus, it wouldn't have to stay a non-standard field. If it's accepted into
>> CPython 3.4, we could safely use it in all existing versions of CPython.
>
>
> Sounds good. Perhaps just find a single "extended", then add a new flag
> field in our payload, in case we need to extend the types object yet again
> later and run out of unused flag bits (TBD: figure out how many unused flag
> bits there are).
>
> Dag
>
>_____________________________________________

> cython-devel mailing list
> cython-devel at python.org
> http://mail.python.org/mailman/listinfo/cython-devel
_____________________________________________

cython-devel mailing list
cython-devel at python.org
http://mail.python.org/mailman/listinfo/cython-devel

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/cython-devel/attachments/20120413/1bd1af3c/attachment.html>