[Cython] CEP1000: Native dispatch through callables

Tue Apr 17 21:38:41 CEST 2012

On Tue, Apr 17, 2012 at 8:07 PM, Dag Sverre Seljebotn
<d.s.seljebotn at astro.uio.no> wrote:
>
>
> Nathaniel Smith <njs at pobox.com> wrote:
>
>>On Tue, Apr 17, 2012 at 3:34 PM, Dag Sverre Seljebotn
>><d.s.seljebotn at astro.uio.no> wrote:
>>> On 04/17/2012 04:20 PM, Nathaniel Smith wrote:
>>>> Since you've set this up... I have a suggestion for something that
>>may
>>>> be worth trying, though I've hesitated to propose it seriously. And
>>>> that is, an API where instead of scanning a table, the C-callable
>>>> exposes a pointer-to-function, something like
>>>>   int get_funcptr(PyObject * self, PyBytesObject * signature, struct
>>>> c_function_info * out)
>>>
>>>
>>> Hmm. There's many ways to implement that function though. It shifts
>>the
>>> scanning logic from the caller to the callee;
>>
>>Yes, that's part of the point :-). Or, well, I guess the point is more
>>that it shifts the scanning logic from the ABI docs to the callee.
>
> Well, really it shifts the logic to the getfuncptr argument specification -- is the signature argument an interned string, encoded string, sha1 hash,...
>
> Part of the table storage format is shifted from the CEP but that is so unimportant it has not even been discussed.
>
>>
>>> you would need to call it
>>> multiple times for different signatures...
>>
>>Yes, I'm not sure what I think about this -- there are arguments
>>either way for who should handle promotion. E.g., imagine the
>>following situation:
>>
>>We have a JITable function
>>We have already JITed the int64 version of this function
>>Now we want to call it with an int32
>>Question: should we promote to int64, or should we JIT?
>
> I think we got close to a good solution to this dilemma earlier in this thread:
>
>  - Callers promote scalars to 64 bit if no exact match is found (and JITs only use 64 bit scalars)
>
>  - Arrays and pointers are the real issue. In this case the caller request another signature (and the JIT kicks in)
>
> The utility of re-jiting for scalars is very limited; it is vital for arrays and pointers.

Nonetheless, I think these rules will run into some trouble (starting
with, JITs only use long double?), and esp. if you want to convince
python-dev of them. But again, I don't think it's so terrible if the
caller just picks some different signatures that it's willing to deal
with, for now.

>
>>
>>Later you write:
>>> if found in table:
>>>   do dispatch
>>> else if object supports get_funcptr:
>>>   call get_funcptr
>>> else:
>>>   python dispatch
>>
>>If we do promotion during the table scanning, then we'll never call
>>get_funcptr and we'll never JIT an int32 version. OTOH, if we call
>>get_funcptr before doing promotion, then we'll end up calling
>>get_funcptr multiple times for different signatures regardless.
>>
>>OTOOH, there are a *lot* of possible coercions for, say, a 3-argument
>>function with return, so just enumerating them is not necessarily a
>>good strategy. Possibly if get_functpr can't handle the initial
>>signature, it should return a table of signatures that it *is* willing
>>to handle... assuming that most callees will either be able to handle
>>a fixed set of types (cython variants) or else handle pretty much
>>anything (JIT), and only the former will reach this code path. Or we
>>could write down the allowed promotions (stealing from the C99 spec),
>>and require the callee to pick the best promotion if it can't handle
>>the initial request. Or we could put this part off until version 2,
>>once we see how eager callers are to actually implement a real
>>promotion engine.
>
> I wanted to leave getfuncptr for another CEP.
>
> There's all kind of stuff -- how does the JIT determine that the argument arrays are large enough to justify JITing? Etc.

I'm sort of inclined to follow KISS here, and say that this isn't
PyPy, we aren't trying to get optimal performance on large, arbitrary
programs. If someone took the trouble to write a function in a special
JIT-able Python subset/dialect and then passed it to a C code, it's
because they know that JITing is worth it we and should just do it
unconditionally. Maybe that'll have to be revised later, but it seems
like a plausible way to get started...

Anyway, getfuncptr alone is actually simpler spec-wise than the array
lookup approach, and the flexibility is an added bonus; it's just a
question of whether it will work.

>>
>>> But if the overhead can be shown to be miniscule then it does perhaps
>>make
>>> the API nicer, even if it feels like paying for nothing at the
>>moment. But
>>> see below.
>>>
>>> Will definitely not get around to this today; anyone else feel
>>free...
>>>
>>>
>>>>
>>>> The rationale is, if we want to support JITed functions where new
>>>> function pointers may be generated on the fly, the array approach
>>has
>>>> a serious problem. You have to decide how many array slots to
>>allocate
>>>> ahead of time, and if you run out, then... too bad. I guess you get
>>to
>>>
>>>
>>> Note that the table is jumped to by a pointer in the PyObject, i.e.
>>the
>>> PyObject I've tested with is
>>>
>>> [object data, &table, table]
>>
>>Oh, I see! I thought you were embedding it in the object, to avoid an
>>extra indirection (and potential cache miss).
> That's probably
>
> Note that in my benchmark the data was right next to the pointer, I think the cost was minor.

Yeah, I'm not worried about your benchmark; the only case that seems
to really matter is when the cache is cold. Two cache misses are worse
than one.

>>necessary, for the reasons you say, but also makes the get_funcptr
>>approach potentially more competitive.
>>
>>> So a JIT could have the table in a separate location on the heap,
>>then it
>>> can allocate a new table, copy over the contents, and when everything
>>is
>>> ready, then do an atomic pointer update (using the assembly
>>instructions/gcc
>>> intrinsics, not pthreads or locking).
>>>
>>> The old table would need to linger for a bit, but could at latest be
>>> deallocated when the PyObject is deallocated.
>>
>>IMHO we should just hold the GIL through lookups, which would simplify
>>tihs, but that's mostly based on the naive intuition that we shouldn't
>>be passing around Python boxes in no-GIL code. Maybe there are good
>>reasons to.
>
> Your intuition about the GIL is wrong as far as Cython is concerned, you are allowed to call cdef 'nogil' methods on refcounted Cython objects without the GIL.

But at least the docs claim that you can't pass a boxed C-callable to
such a method: "If you are implementing such a function in Cython, it
cannot have any Python arguments, ...".

-- Nathaniel