[Python-Dev] C-level duck typing

Wed May 16 17:40:23 CEST 2012

Dag Sverre Seljebotn wrote:
> On 05/16/2012 02:47 PM, Mark Shannon wrote:
>> Stefan Behnel wrote:
>>> Dag Sverre Seljebotn, 16.05.2012 12:48:
>>>> On 05/16/2012 11:50 AM, "Martin v. Löwis" wrote:
>>>>>> Agreed in general, but in this case, it's really not that easy. A C
>>>>>> function call involves a certain overhead all by itself, so calling
>>>>>> into
>>>>>> the C-API multiple times may be substantially more costly than, say,
>>>>>> calling through a function pointer once and then running over a
>>>>>> returned C
>>>>>> array comparing numbers. And definitely way more costly than
>>>>>> running over
>>>>>> an array that the type struct points to directly. We are not talking
>>>>>> about
>>>>>> hundreds of entries here, just a few. A linear scan in 64 bit steps
>>>>>> over
>>>>>> something like a hundred bytes in the L1 cache should hardly be
>>>>>> measurable.
>>>>> I give up, then. I fail to understand the problem. Apparently, you 
>>>>> want
>>>>> to do something with the value you get from this lookup operation, but
>>>>> that something won't involve function calls (or else the function call
>>>>> overhead for the lookup wouldn't be relevant).
>>>> In our specific case the value would be an offset added to the
>>>> PyObject*,
>>>> and there we would find a pointer to a C function (together with a
>>>> 64-bit
>>>> signature), and calling that C function (after checking the 64 bit
>>>> signature) is our final objective.
>>>
>>> I think the use case hasn't been communicated all that clearly yet. 
>>> Let's
>>> give it another try.
>>>
>>> Imagine we have two sides, one that provides a callable and the other
>>> side
>>> that wants to call it. Both sides are implemented in C, so the callee
>>> has a
>>> C signature and the caller has the arguments available as C data
>>> types. The
>>> signature may or may not match the argument types exactly (float vs.
>>> double, int vs. long, ...), because the caller and the callee know
>>> nothing
>>> about each other initially, they just happen to appear in the same
>>> program
>>> at runtime. All they know is that they could call each other through
>>> Python
>>> space, but that would require data conversion, tuple packing, calling,
>>> tuple unpacking, data unpacking, and then potentially the same thing
>>> on the
>>> way back. They want to avoid that overhead.
>>>
>>> Now, the caller needs to figure out if the callee has a compatible
>>> signature. The callee may provide more than one signature (i.e. more 
>>> than
>>> one C call entry point), perhaps because it is implemented to deal with
>>> different input data types efficiently, or perhaps because it can
>>> efficiently convert them to its expected input. So, there is a
>>> signature on
>>> the caller side given by the argument types it holds, and a couple of
>>> signature on the callee side that can accept different C data input. 
>>> Then
>>> the caller needs to find out which signatures there are and match them
>>> against what it can efficiently call. It may even be a JIT compiler that
>>> can generate an efficient call signature on the fly, given a suitable
>>> signature on callee side.
>>
>>>
>>> An example for this is an algorithm that evaluates a user provided
>>> function
>>> on a large NumPy array. The caller knows what array type it is operating
>>> on, and the user provided function may be designed to efficiently 
>>> operate
>>> on arrays of int, float and double entries.
>>
>> Given that use case, can I suggest the following:
>>
>> Separate the discovery of the function from its use.
>> By this I mean first lookup the function (outside of the loop)
>> then use the function (inside the loop).
> 
> We would obviously do that when we can. But Cython is a compiler/code 
> translator, and we don't control usecases. You can easily make up 
> usecases (= Cython code people write) where you can't easily separate 
> the two.
> 
> For instance, the Sage projects has hundreds of thousands of lines of 
> object-oriented Cython code (NOT just array-oriented, but also graphs 
> and trees and stuff), which is all based on Cython's own fast vtable 
> dispatches a la C++. They might want to clean up their code and more 
> generic callback objects some places.
> 
> Other users currently pass around C pointers for callback functions, and 
> we'd like to tell them "pass around these nicer Python callables 
> instead, honestly, the penalty is only 2 ns per call". (*Regardless* of 
> how you use them, like making sure you use them in a loop where we can 
> statically pull out the function pointer acquisition. Saying "this is 
> only non-sluggish if you do x, y, z puts users off.)

Why not pass around a PyCFunction object, instead of a C function
pointer. It contains two fields: the function pointer and the object 
(self), which is exactly what you want.

Of course, the PyCFunction object only allows a limited range of
function types, which is why I am suggesting a variant which supports a
wider range of C function pointer types.

Is a single extra indirection in obj->func() rather than func(),
really that inefficient?
If you are passing around raw pointers, you have already given up on
dynamic type checking.

> 
> I'm not asking you to consider the details of all that. Just to allow 
> some kind of high-performance extensibility of PyTypeObject, so that we 
> can *stop* bothering python-dev with specific requirements from our 
> parallel universe of nearly-all-Cython-and-Fortran-and-C++ codebases :-)

If I read it correctly, you have two problems you wish to solve:
1. A fast callable that can be passed around (see above)
2. Fast access to that callable from a type.

The solution for 2. is the  _PyType_Lookup() function.
By the time you have fixed your proposed solution to properly handle 
subclassing I doubt it will be any quicker than _PyType_Lookup().

Cheers,
Mark.