[Python-Dev] C-level duck typing

Wed May 16 18:29:52 CEST 2012

Robert Bradshaw wrote:
> On Wed, May 16, 2012 at 8:40 AM, Mark Shannon <mark at hotpy.org> wrote:
>> Dag Sverre Seljebotn wrote:
>>> On 05/16/2012 02:47 PM, Mark Shannon wrote:
>>>> Stefan Behnel wrote:
>>>>> Dag Sverre Seljebotn, 16.05.2012 12:48:
>>>>>> On 05/16/2012 11:50 AM, "Martin v. Löwis" wrote:
>>>>>>>> Agreed in general, but in this case, it's really not that easy. A C
>>>>>>>> function call involves a certain overhead all by itself, so calling
>>>>>>>> into
>>>>>>>> the C-API multiple times may be substantially more costly than, say,
>>>>>>>> calling through a function pointer once and then running over a
>>>>>>>> returned C
>>>>>>>> array comparing numbers. And definitely way more costly than
>>>>>>>> running over
>>>>>>>> an array that the type struct points to directly. We are not talking
>>>>>>>> about
>>>>>>>> hundreds of entries here, just a few. A linear scan in 64 bit steps
>>>>>>>> over
>>>>>>>> something like a hundred bytes in the L1 cache should hardly be
>>>>>>>> measurable.
>>>>>>> I give up, then. I fail to understand the problem. Apparently, you
>>>>>>> want
>>>>>>> to do something with the value you get from this lookup operation, but
>>>>>>> that something won't involve function calls (or else the function call
>>>>>>> overhead for the lookup wouldn't be relevant).
>>>>>> In our specific case the value would be an offset added to the
>>>>>> PyObject*,
>>>>>> and there we would find a pointer to a C function (together with a
>>>>>> 64-bit
>>>>>> signature), and calling that C function (after checking the 64 bit
>>>>>> signature) is our final objective.
>>>>>
>>>>> I think the use case hasn't been communicated all that clearly yet.
>>>>> Let's
>>>>> give it another try.
>>>>>
>>>>> Imagine we have two sides, one that provides a callable and the other
>>>>> side
>>>>> that wants to call it. Both sides are implemented in C, so the callee
>>>>> has a
>>>>> C signature and the caller has the arguments available as C data
>>>>> types. The
>>>>> signature may or may not match the argument types exactly (float vs.
>>>>> double, int vs. long, ...), because the caller and the callee know
>>>>> nothing
>>>>> about each other initially, they just happen to appear in the same
>>>>> program
>>>>> at runtime. All they know is that they could call each other through
>>>>> Python
>>>>> space, but that would require data conversion, tuple packing, calling,
>>>>> tuple unpacking, data unpacking, and then potentially the same thing
>>>>> on the
>>>>> way back. They want to avoid that overhead.
>>>>>
>>>>> Now, the caller needs to figure out if the callee has a compatible
>>>>> signature. The callee may provide more than one signature (i.e. more
>>>>> than
>>>>> one C call entry point), perhaps because it is implemented to deal with
>>>>> different input data types efficiently, or perhaps because it can
>>>>> efficiently convert them to its expected input. So, there is a
>>>>> signature on
>>>>> the caller side given by the argument types it holds, and a couple of
>>>>> signature on the callee side that can accept different C data input.
>>>>> Then
>>>>> the caller needs to find out which signatures there are and match them
>>>>> against what it can efficiently call. It may even be a JIT compiler that
>>>>> can generate an efficient call signature on the fly, given a suitable
>>>>> signature on callee side.
>>>>
>>>>> An example for this is an algorithm that evaluates a user provided
>>>>> function
>>>>> on a large NumPy array. The caller knows what array type it is operating
>>>>> on, and the user provided function may be designed to efficiently
>>>>> operate
>>>>> on arrays of int, float and double entries.
>>>>
>>>> Given that use case, can I suggest the following:
>>>>
>>>> Separate the discovery of the function from its use.
>>>> By this I mean first lookup the function (outside of the loop)
>>>> then use the function (inside the loop).
>>>
>>> We would obviously do that when we can. But Cython is a compiler/code
>>> translator, and we don't control usecases. You can easily make up usecases
>>> (= Cython code people write) where you can't easily separate the two.
>>>
>>> For instance, the Sage projects has hundreds of thousands of lines of
>>> object-oriented Cython code (NOT just array-oriented, but also graphs and
>>> trees and stuff), which is all based on Cython's own fast vtable dispatches
>>> a la C++. They might want to clean up their code and more generic callback
>>> objects some places.
>>>
>>> Other users currently pass around C pointers for callback functions, and
>>> we'd like to tell them "pass around these nicer Python callables instead,
>>> honestly, the penalty is only 2 ns per call". (*Regardless* of how you use
>>> them, like making sure you use them in a loop where we can statically pull
>>> out the function pointer acquisition. Saying "this is only non-sluggish if
>>> you do x, y, z puts users off.)
>>
>> Why not pass around a PyCFunction object, instead of a C function
>> pointer. It contains two fields: the function pointer and the object (self),
>> which is exactly what you want.
>>
>> Of course, the PyCFunction object only allows a limited range of
>> function types, which is why I am suggesting a variant which supports a
>> wider range of C function pointer types.
>>
>> Is a single extra indirection in obj->func() rather than func(),
>> really that inefficient?
>> If you are passing around raw pointers, you have already given up on
>> dynamic type checking.
>>
>>
>>> I'm not asking you to consider the details of all that. Just to allow some
>>> kind of high-performance extensibility of PyTypeObject, so that we can
>>> *stop* bothering python-dev with specific requirements from our parallel
>>> universe of nearly-all-Cython-and-Fortran-and-C++ codebases :-)
>>
>> If I read it correctly, you have two problems you wish to solve:
>> 1. A fast callable that can be passed around (see above)
>> 2. Fast access to that callable from a type.
>>
>> The solution for 2. is the  _PyType_Lookup() function.
>> By the time you have fixed your proposed solution to properly handle
>> subclassing I doubt it will be any quicker than _PyType_Lookup().
> 
> It is certainly (2) that we are most interested in solving here; (1)
> can be solved in a variety of ways. For this second point, we're
> looking for something that's faster than a dictionary lookup. (For
> example, common usecase is user-provided functions operating on C
> doubles which can be quite fast.)

_PyType_Lookup() is fast; it doesn't perform any dictionary lookups if 
the (type, attribute) pair is in the cache.

> 
> The PyTypeObject struct is in large part a list of methods that were
> deemed too common and time-critical to merit the dictionary lookup
> (and Python call) overhead. Unfortunately, it's not extensible. We
> figured it'd be useful to get any feedback from the large Python
> community on how best to add extensibility, in particular with an eye
> for being future-proof and possibly an official part of the standard
> for some future version of Python.

I don't see any problem with making  _PyType_Lookup() public.
But others might disagree.

Cheers,
Mark.