C-level duck typing
Hi python-dev, these ideas/questions comes out of the Cython and NumPy developer lists. What we want is a way to communicate things on the C level about the extension type instances we pass around. The solution today is often to rely on PyObject_TypeCheck. For instance, hundreds of handcrafted C extensions rely on the internal structure of NumPy arrays, and Cython will check whether objects are instances of a Cython class or not. However, this creates one-to-many situations; only one implementor of an object API/ABI, but many consumers. What we would like is multiple implementors and multiple consumers of mutually agreed-upon standards. We essentially want more duck typing on the C level. A similar situation was PEP 3118. But there's many more such things one might want to communicate at the C level, many of which are very domain-specific and not suitable for a PEP at all. Also PEPs don't backport well to older versions of Python. What we *think* we would like (but we want other suggestions!) is an arbitrarily extensible type object, without tying this into the type hierarchy. Say you have typedef struct { unsigned long extension_id; void *data; } PyTypeObjectExtensionEntry; and then a type object can (somehow!) point to an array of these. The array is linearly scanned by consumers for IDs they recognize (most types would only have one or two entries). Cython could then get a reserved ID space to communicate whatever it wants, NumPy another one, and there could be "unofficial PEPs" where two or more projects get together to draft a spec for a particular type extension ID without having to bother python-dev about it. And, we want this to somehow work with existing Python; we still support users on Python 2.4. Options we've thought of so far: a) Use dicts and capsules to get information across. But performance-wise the dict lookup is not an option for what we want to use this for in Cython. b) Implement a metaclass which extends PyTypeObject in this way. However, that means a common runtime dependency for libraries that want to use this scheme, which is a big disadvantage to us. Today, Cython doesn't ship a runtime library but only creates standalone compileable C files, and there's no dependency from NumPy on Cython or the other way around. c) Hijack a free bit in tp_flags (22?) which we use to indicate that the PyTypeObject struct is immediately followed by a pointer to such an array. The final approach is drafted in more detail at http://wiki.cython.org/enhancements/cep1001 . To us that looks very attractive both for the speed and for the lack of runtime dependencies, and it seems like it should work in existing versions of Python. But do please feel free to tell us we are misguided. Hijacking a flag bit certainly feels dirty. Examples of how this would be used: - In Cython, we'd like to use this to annotate callable objects that happen to wrap a C function with their corresponding C function pointers. That way, callables that wrap a C function could be "unboxed", so that Cython could "cast" the Python object "scipy.special.gamma" to a function pointer at runtime and speed up the call with an order of magnitude. SciPy and Cython just needs to agree on a spec. - Lots of C extensions rely on using PyObject_TypeCheck (or even do an exact check) before calling the NumPy C API with PyArrayObject* arguments. This means that new features all have to go into NumPy; it is rather difficult to create new experimental array libraries. Extensible PyTypeObject would open up the way for other experimental array libraries; NumPy could make the standards, but others implement them (without getting NumPy as a runtime dependency, which is the consequence of subclassing). Of course, porting over the hundreds (thousands?) of extensions relying on the NumPy C API is a lot of work, but we can at least get started... Ideas? Dag Sverre Seljebotn
Use PyObject_HasAttr, just as people use hasattr() for ducktyping in Python. If you want something more structured, use Abstract Base Classes, that's what they're for. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
On 05/16/2012 10:11 AM, Nick Coghlan wrote:
Use PyObject_HasAttr, just as people use hasattr() for ducktyping in Python.
In the Cython wrap-function-pointers case we really want performance comparable to C, so we couldn't do the whole thing. But I guess we could intern some char* (somehow), pass that to tp_getattr, and then cast the returned PyObject* (which would not be a PyObject*) to a custom struct. As long as the interned strings can never be reached from Python that's almost safe. It's still slight abuse of tp_getattr. As I said, if we didn't worry about performance we'd just retrieve capsules through attributes. Dag
If you want something more structured, use Abstract Base Classes, that's what they're for.
Cheers, Nick.
And, we want this to somehow work with existing Python; we still support users on Python 2.4.
This makes the question out-of-scope for python-dev - we only discuss new versions of Python here. Old versions cannot be developed anymore (as they are released already).
typedef struct { unsigned long extension_id; void *data; } PyTypeObjectExtensionEntry;
and then a type object can (somehow!) point to an array of these. The array is linearly scanned
It's unclear to me why you think that a linear scan is faster than a dictionary lookup. The contrary will be the case - the dictionary lookup (PyObject_GetAttr) will be much faster. Just make sure to use interned strings, and to cache the interned strings. Regards, Martin
On 05/16/2012 10:36 AM, "Martin v. Löwis" wrote:
And, we want this to somehow work with existing Python; we still support users on Python 2.4.
This makes the question out-of-scope for python-dev - we only discuss new versions of Python here. Old versions cannot be developed anymore (as they are released already).
Point taken. Sorry about that, and I appreciate your patience with me. I guess my idea was that if some mechanism was approved for future Python versions, we would feel easier about hacking around older Python versions. Of course, nothing is better than this not being a problem, as you seem to suggest. But:
typedef struct { unsigned long extension_id; void *data; } PyTypeObjectExtensionEntry;
and then a type object can (somehow!) point to an array of these. The array is linearly scanned
It's unclear to me why you think that a linear scan is faster than a dictionary lookup. The contrary will be the case - the dictionary lookup (PyObject_GetAttr) will be much faster.
I've benchmarked using a PyObject* as a function-pointer-capsule using the above mechanism; that added about 2-3 nanoseconds of overhead on what would be a 5 nanosecond call in pure C. (There will only be 1 or 2 entries in that list...) Dict lookups are about 18 nanoseconds for me, using interned string objects (see below). Perhaps that can be reduced somewhat, but I highly doubt you'll get to 3-4 nanoseconds? Cython benchmark (which does translate do what you'd do in C): def hammer_dict(int n): cdef dict the_dict a = "hello" b = "there" the_dict = {a : a, b : a} for i in range(n): the_dict[b] Dag
"Martin v. Löwis", 16.05.2012 10:36:
And, we want this to somehow work with existing Python; we still support users on Python 2.4.
This makes the question out-of-scope for python-dev - we only discuss new versions of Python here. Old versions cannot be developed anymore (as they are released already).
Well, it's in scope because CPython would have to support this in a future version, or at least have to make sure it knows about it so that it can stay out of the way (depending on how the solution eventually ends up working). We're also very much interested in input from the CPython core developers regarding the design, because we think that this should become a general feature of the Python platform (potentially also for PyPy etc.). The fact that we need to support it in older CPython versions is also relevant, because the solution we choose shouldn't conflict with older versions. The fact that they are no longer developed actually helps, because it will prevent them from interfering in the future any more than they do now.
typedef struct { unsigned long extension_id; void *data; } PyTypeObjectExtensionEntry;
and then a type object can (somehow!) point to an array of these. The array is linearly scanned
It's unclear to me why you think that a linear scan is faster than a dictionary lookup. The contrary will be the case - the dictionary lookup (PyObject_GetAttr) will be much faster.
Agreed in general, but in this case, it's really not that easy. A C function call involves a certain overhead all by itself, so calling into the C-API multiple times may be substantially more costly than, say, calling through a function pointer once and then running over a returned C array comparing numbers. And definitely way more costly than running over an array that the type struct points to directly. We are not talking about hundreds of entries here, just a few. A linear scan in 64 bit steps over something like a hundred bytes in the L1 cache should hardly be measurable. This might sound like a premature micro optimisation, but these things can quickly add up, e.g. when running a user provided function over a large array. (And, yes, we'd try to do caching and all that...) Stefan
Agreed in general, but in this case, it's really not that easy. A C function call involves a certain overhead all by itself, so calling into the C-API multiple times may be substantially more costly than, say, calling through a function pointer once and then running over a returned C array comparing numbers. And definitely way more costly than running over an array that the type struct points to directly. We are not talking about hundreds of entries here, just a few. A linear scan in 64 bit steps over something like a hundred bytes in the L1 cache should hardly be measurable.
I give up, then. I fail to understand the problem. Apparently, you want to do something with the value you get from this lookup operation, but that something won't involve function calls (or else the function call overhead for the lookup wouldn't be relevant). I still think this is out of scope for python-dev. If this is something you want to be able to do for Python 2.4 as well, then you don't need any change to Python - you can do whatever you come up with for all Python versions, no need to (or point in) changing Python 3.4 (say). As this is apparently only relevant to speed fanatics, too, I suggest that you check how fast PyPI works for you. Supposedly, they have very efficient lookup procedures, supported by the JIT. If this doesn't work for some reason, I suggest that you'll have to trade speed for convenience: a compile-time fixed layout will beat any dynamic lookup any time. Just define a common base class, and have all interesting types inherit from it. Regards, Martin
On 05/16/2012 11:50 AM, "Martin v. Löwis" wrote:
Agreed in general, but in this case, it's really not that easy. A C function call involves a certain overhead all by itself, so calling into the C-API multiple times may be substantially more costly than, say, calling through a function pointer once and then running over a returned C array comparing numbers. And definitely way more costly than running over an array that the type struct points to directly. We are not talking about hundreds of entries here, just a few. A linear scan in 64 bit steps over something like a hundred bytes in the L1 cache should hardly be measurable.
I give up, then. I fail to understand the problem. Apparently, you want to do something with the value you get from this lookup operation, but that something won't involve function calls (or else the function call overhead for the lookup wouldn't be relevant).
In our specific case the value would be an offset added to the PyObject*, and there we would find a pointer to a C function (together with a 64-bit signature), and calling that C function (after checking the 64 bit signature) is our final objective.
I still think this is out of scope for python-dev. If this is something you want to be able to do for Python 2.4 as well, then you don't need any change to Python - you can do whatever you come up with for all Python versions, no need to (or point in) changing Python 3.4 (say).
We can go ahead and hijack tp_flags bit 22 to make things work in existing versions. But what if Python 3.8 then starts using that bit for something else?
As this is apparently only relevant to speed fanatics, too, I suggest that you check how fast PyPI works for you. Supposedly, they have very efficient lookup procedures, supported by the JIT. If this doesn't work for some reason, I suggest that you'll have to trade speed for convenience: a compile-time fixed layout will beat any dynamic lookup any time. Just define a common base class, and have all interesting types inherit from it.
Did you mean PyPy? Me and Stefan are Cython developers, so that's kind of our angle... And I'm a Cython developer because it solves a practical need (in my case in scientific computation), not because I think it's that beautiful. PyPy won't work for me (let's not go down that road now...) Defining a common base class is what NumPy already does, and Cython would be forced to without this proposal. We just think it has significant disadvantages and were looking for something else. Dag
Dag Sverre Seljebotn, 16.05.2012 12:48:
On 05/16/2012 11:50 AM, "Martin v. Löwis" wrote:
Agreed in general, but in this case, it's really not that easy. A C function call involves a certain overhead all by itself, so calling into the C-API multiple times may be substantially more costly than, say, calling through a function pointer once and then running over a returned C array comparing numbers. And definitely way more costly than running over an array that the type struct points to directly. We are not talking about hundreds of entries here, just a few. A linear scan in 64 bit steps over something like a hundred bytes in the L1 cache should hardly be measurable.
I give up, then. I fail to understand the problem. Apparently, you want to do something with the value you get from this lookup operation, but that something won't involve function calls (or else the function call overhead for the lookup wouldn't be relevant).
In our specific case the value would be an offset added to the PyObject*, and there we would find a pointer to a C function (together with a 64-bit signature), and calling that C function (after checking the 64 bit signature) is our final objective.
I think the use case hasn't been communicated all that clearly yet. Let's give it another try. Imagine we have two sides, one that provides a callable and the other side that wants to call it. Both sides are implemented in C, so the callee has a C signature and the caller has the arguments available as C data types. The signature may or may not match the argument types exactly (float vs. double, int vs. long, ...), because the caller and the callee know nothing about each other initially, they just happen to appear in the same program at runtime. All they know is that they could call each other through Python space, but that would require data conversion, tuple packing, calling, tuple unpacking, data unpacking, and then potentially the same thing on the way back. They want to avoid that overhead. Now, the caller needs to figure out if the callee has a compatible signature. The callee may provide more than one signature (i.e. more than one C call entry point), perhaps because it is implemented to deal with different input data types efficiently, or perhaps because it can efficiently convert them to its expected input. So, there is a signature on the caller side given by the argument types it holds, and a couple of signature on the callee side that can accept different C data input. Then the caller needs to find out which signatures there are and match them against what it can efficiently call. It may even be a JIT compiler that can generate an efficient call signature on the fly, given a suitable signature on callee side. An example for this is an algorithm that evaluates a user provided function on a large NumPy array. The caller knows what array type it is operating on, and the user provided function may be designed to efficiently operate on arrays of int, float and double entries. Does this use case make sense to everyone? The reason why we are discussing this on python-dev is that we are looking for a general way to expose these C level signatures within the Python ecosystem. And Dag's idea was to expose them as part of the type object, basically as an addition to the current Python level tp_call() slot. Stefan
Stefan Behnel, 16.05.2012 13:13:
Dag Sverre Seljebotn, 16.05.2012 12:48:
On 05/16/2012 11:50 AM, "Martin v. Löwis" wrote:
Agreed in general, but in this case, it's really not that easy. A C function call involves a certain overhead all by itself, so calling into the C-API multiple times may be substantially more costly than, say, calling through a function pointer once and then running over a returned C array comparing numbers. And definitely way more costly than running over an array that the type struct points to directly. We are not talking about hundreds of entries here, just a few. A linear scan in 64 bit steps over something like a hundred bytes in the L1 cache should hardly be measurable.
I give up, then. I fail to understand the problem. Apparently, you want to do something with the value you get from this lookup operation, but that something won't involve function calls (or else the function call overhead for the lookup wouldn't be relevant).
In our specific case the value would be an offset added to the PyObject*, and there we would find a pointer to a C function (together with a 64-bit signature), and calling that C function (after checking the 64 bit signature) is our final objective.
I think the use case hasn't been communicated all that clearly yet. Let's give it another try.
Imagine we have two sides, one that provides a callable and the other side that wants to call it. Both sides are implemented in C, so the callee has a C signature and the caller has the arguments available as C data types. The signature may or may not match the argument types exactly (float vs. double, int vs. long, ...), because the caller and the callee know nothing about each other initially, they just happen to appear in the same program at runtime. All they know is that they could call each other through Python space, but that would require data conversion, tuple packing, calling, tuple unpacking, data unpacking, and then potentially the same thing on the way back. They want to avoid that overhead.
Now, the caller needs to figure out if the callee has a compatible signature. The callee may provide more than one signature (i.e. more than one C call entry point), perhaps because it is implemented to deal with different input data types efficiently, or perhaps because it can efficiently convert them to its expected input. So, there is a signature on the caller side given by the argument types it holds, and a couple of signature on the callee side that can accept different C data input. Then the caller needs to find out which signatures there are and match them against what it can efficiently call. It may even be a JIT compiler that can generate an efficient call signature on the fly, given a suitable signature on callee side.
An example for this is an algorithm that evaluates a user provided function on a large NumPy array. The caller knows what array type it is operating on, and the user provided function may be designed to efficiently operate on arrays of int, float and double entries.
Does this use case make sense to everyone?
The reason why we are discussing this on python-dev is that we are looking for a general way to expose these C level signatures within the Python ecosystem. And Dag's idea was to expose them as part of the type object, basically as an addition to the current Python level tp_call() slot.
... and to finish the loop that I started here (sorry for being verbose): The proposal that Dag referenced describes a more generic way to make this kind of extension to type objects from user code. Basically, it allows implementers to say "my type object has capability X", in a C-ish kind of way. And the above C signature protocol would be one of those capabilities. Personally, I wouldn't mind making the specific signature extension a proposal instead of asking for a general extension mechanism for arbitrary capabilities (although that still sounds tempting). Stefan
On 05/16/2012 02:16 PM, Stefan Behnel wrote:
Stefan Behnel, 16.05.2012 13:13:
Dag Sverre Seljebotn, 16.05.2012 12:48:
On 05/16/2012 11:50 AM, "Martin v. Löwis" wrote:
Agreed in general, but in this case, it's really not that easy. A C function call involves a certain overhead all by itself, so calling into the C-API multiple times may be substantially more costly than, say, calling through a function pointer once and then running over a returned C array comparing numbers. And definitely way more costly than running over an array that the type struct points to directly. We are not talking about hundreds of entries here, just a few. A linear scan in 64 bit steps over something like a hundred bytes in the L1 cache should hardly be measurable.
I give up, then. I fail to understand the problem. Apparently, you want to do something with the value you get from this lookup operation, but that something won't involve function calls (or else the function call overhead for the lookup wouldn't be relevant).
In our specific case the value would be an offset added to the PyObject*, and there we would find a pointer to a C function (together with a 64-bit signature), and calling that C function (after checking the 64 bit signature) is our final objective.
I think the use case hasn't been communicated all that clearly yet. Let's give it another try.
Imagine we have two sides, one that provides a callable and the other side that wants to call it. Both sides are implemented in C, so the callee has a C signature and the caller has the arguments available as C data types. The signature may or may not match the argument types exactly (float vs. double, int vs. long, ...), because the caller and the callee know nothing about each other initially, they just happen to appear in the same program at runtime. All they know is that they could call each other through Python space, but that would require data conversion, tuple packing, calling, tuple unpacking, data unpacking, and then potentially the same thing on the way back. They want to avoid that overhead.
Now, the caller needs to figure out if the callee has a compatible signature. The callee may provide more than one signature (i.e. more than one C call entry point), perhaps because it is implemented to deal with different input data types efficiently, or perhaps because it can efficiently convert them to its expected input. So, there is a signature on the caller side given by the argument types it holds, and a couple of signature on the callee side that can accept different C data input. Then the caller needs to find out which signatures there are and match them against what it can efficiently call. It may even be a JIT compiler that can generate an efficient call signature on the fly, given a suitable signature on callee side.
An example for this is an algorithm that evaluates a user provided function on a large NumPy array. The caller knows what array type it is operating on, and the user provided function may be designed to efficiently operate on arrays of int, float and double entries.
Does this use case make sense to everyone?
The reason why we are discussing this on python-dev is that we are looking for a general way to expose these C level signatures within the Python ecosystem. And Dag's idea was to expose them as part of the type object, basically as an addition to the current Python level tp_call() slot.
... and to finish the loop that I started here (sorry for being verbose):
The proposal that Dag referenced describes a more generic way to make this kind of extension to type objects from user code. Basically, it allows implementers to say "my type object has capability X", in a C-ish kind of way. And the above C signature protocol would be one of those capabilities.
Personally, I wouldn't mind making the specific signature extension a proposal instead of asking for a general extension mechanism for arbitrary capabilities (although that still sounds tempting).
Here's some reasons for the generic proposal: a) Avoid pre-mature PEP-ing. Look at PEP 3118 for instance; that would almost certainly had been better if there had been a few years of beta-testing in the wild among Cython and NumPy users. I think PEP-ing the "nativecall" proposal soon (even in the unlikely event that it would be accepted) is bound to give suboptimal results -- it needs to be tested in the wild on Cython and SciPy users for a few years first. (Still, we can't ask those to recompile their Python.) My proposal is then about allowing people to play with their own slots, and deploy that to users, without having to create a PEP for their specific usecase. b) There's more than the "nativecall" we'd use this for in Cython. Something like compiled abstract base classes/compiled multiple inheritance/Go-style interfaces for instance. Some of those things we'd like to use it for certainly will never be a PEP. c) Get NumPy users off their PyObject_TypeCheck habit, which IMO is damaging to the NumPy project (because you can't that easily play around with different array libraries and new ideas -- NumPy is the only array type you can ever have, because millions of code lines have been written using its C API. My proposal provides a way of moving that API over to accept any object implementing a NumPy-specified spec. We certainly don't want to have a 20 nanosecond speed regression on every single call they make to the NumPy C API, and you simply don't rewrite millions of code lines.). I think having millions of lines of "Python" code written in C, and not Python, and considering 20 nanoseconds as "much", is perhaps not the typical usecase on this list. Still, that's the world of scientific computing with Python. Python-the-interpreter is just the "shell" around the real stuff that all happens in C or Fortran. (Cython is not just about scientific computing, as I'm sure Stefan has told you all about. But in other situations I think there's less of a need of "cross-talk" between extensions without going through the Python API.) I guess I don't get "if something needs to be fast on the C level, then that one specific usecase should be in a PEP". And all we're asking for is really that one bit in tp_flags. Dag
Dag wrote:
I'm not asking you to consider the details of all that. Just to allow some kind of high-performance extensibility of PyTypeObject, so that we can *stop* bothering python-dev with specific requirements from our parallel universe of nearly-all-Cython-and-Fortran-and-C++ codebases :-)
Maybe you should ask for a bit more than that. Tp_flags bits are a scarce resource, and they shouldn't be handed out lightly to anyone who asks for one. Eventually they're going to run out, and then something else will have to be done to make further extensions of the type object possible. So maybe it's worth thinking about making a general mechanism available for third parties to extend the type object without them all needing to have their own tp_flags bits and without needing to collude with each other to avoid conflicts. -- Greg
On Wed, May 16, 2012 at 5:03 PM, Greg Ewing
Dag wrote:
I'm not asking you to consider the details of all that. Just to allow some kind of high-performance extensibility of PyTypeObject, so that we can *stop* bothering python-dev with specific requirements from our parallel universe of nearly-all-Cython-and-Fortran-and-C++ codebases :-)
Maybe you should ask for a bit more than that.
Tp_flags bits are a scarce resource, and they shouldn't be handed out lightly to anyone who asks for one. Eventually they're going to run out, and then something else will have to be done to make further extensions of the type object possible.
So maybe it's worth thinking about making a general mechanism available for third parties to extend the type object without them all needing to have their own tp_flags bits and without needing to collude with each other to avoid conflicts.
This is exactly what was proposed to start this thread (with minimal collusion to avoid conflicts, specifically partitioning up a global ID space). - Robert
On 17/05/12 12:17, Robert Bradshaw wrote:
This is exactly what was proposed to start this thread (with minimal collusion to avoid conflicts, specifically partitioning up a global ID space).
Yes, but I think this part of the mechanism needs to be spelled out in more detail, perhaps in the form of a draft PEP. Then there will be something concrete to discuss in python-dev. -- Greg
On 05/17/2012 05:00 AM, Greg Ewing wrote:
On 17/05/12 12:17, Robert Bradshaw wrote:
This is exactly what was proposed to start this thread (with minimal collusion to avoid conflicts, specifically partitioning up a global ID space).
Yes, but I think this part of the mechanism needs to be spelled out in more detail, perhaps in the form of a draft PEP. Then there will be something concrete to discuss in python-dev.
Well, we weren't 100% sure what is the best mechanism, so the point really was to solicit input, even if I got a bit argumentative along the way. Thanks to all of you! If we in the end decide that we would like a propose the PEP, does anyone feel the odds are anything but very, very slim? I don't think I've heard a single positive word about the proposal so far except from Cython devs, so I'm reluctant to spend my own and your time on a fleshing out a full PEP for that reason. In a PEP, the proposal would likely be an additional pointer to a table of "custom PyTypeObject extensions"; not a flag bit. The whole point would be to only do that once, and after that PyTypeObject would be infinitely extensible for custom purposes without collisions (even as a way of pre-testing PEPs about PyTypeObject in the wild before final approval!). Of course, a pointer more per type object is a bigger burden to push on others. The thing is, you *can* just use a subtype of PyType_Type for this purpose (or any purpose), it's just my opinion that it's not be best solution here; it means many different libraries need a common dependency for this reason alone (or dynamically handshake on a base class at runtime). You could just stick that base class in CPython, which would be OK I guess but not great (using the type hierarchy is quite intrusive in general; you didn't subclass PyType_Type to stick in tp_as_buffer either). Dag
I think the main things we'd be looking for would be: - a clear explanation of why a new metaclass is considered too complex a solution - what the implications are for classes that have nothing to do with the SciPy/NumPy ecosystem - how subclassing would behave (both at the class and metaclass level) Yes, defining a new metaclass for fast signature exchange has its challenges - but it means that *our* concerns about maintaining consistent behaviour in the default object model and avoiding adverse effects on code that doesn't need the new behaviour are addressed automatically. Also, I'd consider a functioning reference implementation using a custom metaclass a requirement before we considered modifying type anyway, so I think that's the best thing to pursue next rather than a PEP. It also has the virtue of letting you choose which Python versions to target and iterating at a faster rate than CPython. Cheers, Nick. -- Sent from my phone, thus the relative brevity :)
On 05/18/2012 12:57 AM, Nick Coghlan wrote:
I think the main things we'd be looking for would be: - a clear explanation of why a new metaclass is considered too complex a solution - what the implications are for classes that have nothing to do with the SciPy/NumPy ecosystem - how subclassing would behave (both at the class and metaclass level)
Yes, defining a new metaclass for fast signature exchange has its challenges - but it means that *our* concerns about maintaining consistent behaviour in the default object model and avoiding adverse effects on code that doesn't need the new behaviour are addressed automatically.
Also, I'd consider a functioning reference implementation using a custom metaclass a requirement before we considered modifying type anyway, so I think that's the best thing to pursue next rather than a PEP. It also has the virtue of letting you choose which Python versions to target and iterating at a faster rate than CPython.
This seems right on target. I could make a utility code C header for such a metaclass, and then the different libraries can all include it and handshake on which implementation becomes the real one through sys.modules during module initialization. That way an eventual PEP will only be a natural incremental step to make things more polished, whether that happens by making such a metaclass part of the standard library or by extending PyTypeObject. Thanks, Dag
If we in the end decide that we would like a propose the PEP, does anyone feel the odds are anything but very, very slim? I don't think I've heard a single positive word about the proposal so far except from Cython devs, so I'm reluctant to spend my own and your time on a fleshing out a full PEP for that reason.
Before you do that, it might be useful to publish a precise, reproducible, complete benchmark first, to support the performance figures you have been quoting. I'm skeptical by nature, so I don't believe any of the numbers you have given until I can reproduce them myself. More precisely, I fail to understand what they mean without seeing the source code that produced them (perhaps along with an indication what hardware, operating system, compiler version, and Python version were used to produce them). Regards, Martin
So maybe it's worth thinking about making a general mechanism available for third parties to extend the type object without them all needing to have their own tp_flags bits and without needing to collude with each other to avoid conflicts.
That mechanism is already available. Subclass PyTypeType, and add whatever fields you want. Regards, Martin
Stefan Behnel wrote:
Dag Sverre Seljebotn, 16.05.2012 12:48:
On 05/16/2012 11:50 AM, "Martin v. Löwis" wrote:
Agreed in general, but in this case, it's really not that easy. A C function call involves a certain overhead all by itself, so calling into the C-API multiple times may be substantially more costly than, say, calling through a function pointer once and then running over a returned C array comparing numbers. And definitely way more costly than running over an array that the type struct points to directly. We are not talking about hundreds of entries here, just a few. A linear scan in 64 bit steps over something like a hundred bytes in the L1 cache should hardly be measurable. I give up, then. I fail to understand the problem. Apparently, you want to do something with the value you get from this lookup operation, but that something won't involve function calls (or else the function call overhead for the lookup wouldn't be relevant). In our specific case the value would be an offset added to the PyObject*, and there we would find a pointer to a C function (together with a 64-bit signature), and calling that C function (after checking the 64 bit signature) is our final objective.
I think the use case hasn't been communicated all that clearly yet. Let's give it another try.
Imagine we have two sides, one that provides a callable and the other side that wants to call it. Both sides are implemented in C, so the callee has a C signature and the caller has the arguments available as C data types. The signature may or may not match the argument types exactly (float vs. double, int vs. long, ...), because the caller and the callee know nothing about each other initially, they just happen to appear in the same program at runtime. All they know is that they could call each other through Python space, but that would require data conversion, tuple packing, calling, tuple unpacking, data unpacking, and then potentially the same thing on the way back. They want to avoid that overhead.
Now, the caller needs to figure out if the callee has a compatible signature. The callee may provide more than one signature (i.e. more than one C call entry point), perhaps because it is implemented to deal with different input data types efficiently, or perhaps because it can efficiently convert them to its expected input. So, there is a signature on the caller side given by the argument types it holds, and a couple of signature on the callee side that can accept different C data input. Then the caller needs to find out which signatures there are and match them against what it can efficiently call. It may even be a JIT compiler that can generate an efficient call signature on the fly, given a suitable signature on callee side.
An example for this is an algorithm that evaluates a user provided function on a large NumPy array. The caller knows what array type it is operating on, and the user provided function may be designed to efficiently operate on arrays of int, float and double entries.
Given that use case, can I suggest the following: Separate the discovery of the function from its use. By this I mean first lookup the function (outside of the loop) then use the function (inside the loop). It would then be possible to lookup the function pointer, using the standard API, PyObject_GetAttr (or maybe _PyType_Lookup). Then when it came to applying the function, the function pointer could be used directly. To do this would require an extra builtin-function-like object, which would wrap the C function pointer. Currently the builtin (C) function type only supports a very limited range of types for the underlying function pointer. For example, an extended builtin-function could support (among other types) the C function type double (*func)(double, double). The extended builtin-function would be a Python callable, but would allow C extensions such to NumPy to access the underlying C function directly. The builtin-function declaration would consist of a pointer to the underlying function pointer and a type declaration which states which types it accepts. The VM would be responsible for any unboxing/boxing required. E.g float.__add__ could be constructed from a very simple C function (that adds two doubles and returns a double) and a type declaration: (cdouble, cdouble)->cdouble. Allowable types would be intptr_t, double or PyObject* (and maybe char*) PyObject* types could be further qualified with their Python type. Not allowing char, short, unsigned etc may seem like its too restrictive, but it prevents an explosion of possible types. Allowing only 3 C-level types and no more than 3 parameters (plus return) means that all 121 (3**4+3**3+3**2+3**1+3**0) permutations can be handled without resorting to ctypes/ffi. Example usage: typedef double (*ddd_func)(double, double); ddd_func cfunc; PyObject *func = PyType_Lookup(the_type, the_attribute); if (Py_TYPE(func) == Py_ExtendedBuiltinFunction_Type && str_cmp(Py_ExtendedFunctionBuiltin_TypeOf(func), "d,d->d") == 0) cfunc = Py_ExtendedFunctionBuiltin_GetFunctionPtr(func); else goto feature_not_provided; for (;;) /* Loop using cfunc */ [snip] Cheers, Mark.
On 05/16/2012 02:47 PM, Mark Shannon wrote:
Stefan Behnel wrote:
Dag Sverre Seljebotn, 16.05.2012 12:48:
On 05/16/2012 11:50 AM, "Martin v. Löwis" wrote:
Agreed in general, but in this case, it's really not that easy. A C function call involves a certain overhead all by itself, so calling into the C-API multiple times may be substantially more costly than, say, calling through a function pointer once and then running over a returned C array comparing numbers. And definitely way more costly than running over an array that the type struct points to directly. We are not talking about hundreds of entries here, just a few. A linear scan in 64 bit steps over something like a hundred bytes in the L1 cache should hardly be measurable. I give up, then. I fail to understand the problem. Apparently, you want to do something with the value you get from this lookup operation, but that something won't involve function calls (or else the function call overhead for the lookup wouldn't be relevant). In our specific case the value would be an offset added to the PyObject*, and there we would find a pointer to a C function (together with a 64-bit signature), and calling that C function (after checking the 64 bit signature) is our final objective.
I think the use case hasn't been communicated all that clearly yet. Let's give it another try.
Imagine we have two sides, one that provides a callable and the other side that wants to call it. Both sides are implemented in C, so the callee has a C signature and the caller has the arguments available as C data types. The signature may or may not match the argument types exactly (float vs. double, int vs. long, ...), because the caller and the callee know nothing about each other initially, they just happen to appear in the same program at runtime. All they know is that they could call each other through Python space, but that would require data conversion, tuple packing, calling, tuple unpacking, data unpacking, and then potentially the same thing on the way back. They want to avoid that overhead.
Now, the caller needs to figure out if the callee has a compatible signature. The callee may provide more than one signature (i.e. more than one C call entry point), perhaps because it is implemented to deal with different input data types efficiently, or perhaps because it can efficiently convert them to its expected input. So, there is a signature on the caller side given by the argument types it holds, and a couple of signature on the callee side that can accept different C data input. Then the caller needs to find out which signatures there are and match them against what it can efficiently call. It may even be a JIT compiler that can generate an efficient call signature on the fly, given a suitable signature on callee side.
An example for this is an algorithm that evaluates a user provided function on a large NumPy array. The caller knows what array type it is operating on, and the user provided function may be designed to efficiently operate on arrays of int, float and double entries.
Given that use case, can I suggest the following:
Separate the discovery of the function from its use. By this I mean first lookup the function (outside of the loop) then use the function (inside the loop).
We would obviously do that when we can. But Cython is a compiler/code translator, and we don't control usecases. You can easily make up usecases (= Cython code people write) where you can't easily separate the two. For instance, the Sage projects has hundreds of thousands of lines of object-oriented Cython code (NOT just array-oriented, but also graphs and trees and stuff), which is all based on Cython's own fast vtable dispatches a la C++. They might want to clean up their code and more generic callback objects some places. Other users currently pass around C pointers for callback functions, and we'd like to tell them "pass around these nicer Python callables instead, honestly, the penalty is only 2 ns per call". (*Regardless* of how you use them, like making sure you use them in a loop where we can statically pull out the function pointer acquisition. Saying "this is only non-sluggish if you do x, y, z puts users off.) I'm not asking you to consider the details of all that. Just to allow some kind of high-performance extensibility of PyTypeObject, so that we can *stop* bothering python-dev with specific requirements from our parallel universe of nearly-all-Cython-and-Fortran-and-C++ codebases :-) Dag
Dag Sverre Seljebotn wrote:
On 05/16/2012 02:47 PM, Mark Shannon wrote:
Stefan Behnel wrote:
Dag Sverre Seljebotn, 16.05.2012 12:48:
On 05/16/2012 11:50 AM, "Martin v. Löwis" wrote:
Agreed in general, but in this case, it's really not that easy. A C function call involves a certain overhead all by itself, so calling into the C-API multiple times may be substantially more costly than, say, calling through a function pointer once and then running over a returned C array comparing numbers. And definitely way more costly than running over an array that the type struct points to directly. We are not talking about hundreds of entries here, just a few. A linear scan in 64 bit steps over something like a hundred bytes in the L1 cache should hardly be measurable. I give up, then. I fail to understand the problem. Apparently, you want to do something with the value you get from this lookup operation, but that something won't involve function calls (or else the function call overhead for the lookup wouldn't be relevant). In our specific case the value would be an offset added to the PyObject*, and there we would find a pointer to a C function (together with a 64-bit signature), and calling that C function (after checking the 64 bit signature) is our final objective.
I think the use case hasn't been communicated all that clearly yet. Let's give it another try.
Imagine we have two sides, one that provides a callable and the other side that wants to call it. Both sides are implemented in C, so the callee has a C signature and the caller has the arguments available as C data types. The signature may or may not match the argument types exactly (float vs. double, int vs. long, ...), because the caller and the callee know nothing about each other initially, they just happen to appear in the same program at runtime. All they know is that they could call each other through Python space, but that would require data conversion, tuple packing, calling, tuple unpacking, data unpacking, and then potentially the same thing on the way back. They want to avoid that overhead.
Now, the caller needs to figure out if the callee has a compatible signature. The callee may provide more than one signature (i.e. more than one C call entry point), perhaps because it is implemented to deal with different input data types efficiently, or perhaps because it can efficiently convert them to its expected input. So, there is a signature on the caller side given by the argument types it holds, and a couple of signature on the callee side that can accept different C data input. Then the caller needs to find out which signatures there are and match them against what it can efficiently call. It may even be a JIT compiler that can generate an efficient call signature on the fly, given a suitable signature on callee side.
An example for this is an algorithm that evaluates a user provided function on a large NumPy array. The caller knows what array type it is operating on, and the user provided function may be designed to efficiently operate on arrays of int, float and double entries.
Given that use case, can I suggest the following:
Separate the discovery of the function from its use. By this I mean first lookup the function (outside of the loop) then use the function (inside the loop).
We would obviously do that when we can. But Cython is a compiler/code translator, and we don't control usecases. You can easily make up usecases (= Cython code people write) where you can't easily separate the two.
For instance, the Sage projects has hundreds of thousands of lines of object-oriented Cython code (NOT just array-oriented, but also graphs and trees and stuff), which is all based on Cython's own fast vtable dispatches a la C++. They might want to clean up their code and more generic callback objects some places.
Other users currently pass around C pointers for callback functions, and we'd like to tell them "pass around these nicer Python callables instead, honestly, the penalty is only 2 ns per call". (*Regardless* of how you use them, like making sure you use them in a loop where we can statically pull out the function pointer acquisition. Saying "this is only non-sluggish if you do x, y, z puts users off.)
Why not pass around a PyCFunction object, instead of a C function pointer. It contains two fields: the function pointer and the object (self), which is exactly what you want. Of course, the PyCFunction object only allows a limited range of function types, which is why I am suggesting a variant which supports a wider range of C function pointer types. Is a single extra indirection in obj->func() rather than func(), really that inefficient? If you are passing around raw pointers, you have already given up on dynamic type checking.
I'm not asking you to consider the details of all that. Just to allow some kind of high-performance extensibility of PyTypeObject, so that we can *stop* bothering python-dev with specific requirements from our parallel universe of nearly-all-Cython-and-Fortran-and-C++ codebases :-)
If I read it correctly, you have two problems you wish to solve: 1. A fast callable that can be passed around (see above) 2. Fast access to that callable from a type. The solution for 2. is the _PyType_Lookup() function. By the time you have fixed your proposed solution to properly handle subclassing I doubt it will be any quicker than _PyType_Lookup(). Cheers, Mark.
On Wed, May 16, 2012 at 8:40 AM, Mark Shannon wrote:
Dag Sverre Seljebotn wrote:
On 05/16/2012 02:47 PM, Mark Shannon wrote:
Stefan Behnel wrote:
Dag Sverre Seljebotn, 16.05.2012 12:48:
On 05/16/2012 11:50 AM, "Martin v. Löwis" wrote:
> > Agreed in general, but in this case, it's really not that easy. A C > function call involves a certain overhead all by itself, so calling > into > the C-API multiple times may be substantially more costly than, say, > calling through a function pointer once and then running over a > returned C > array comparing numbers. And definitely way more costly than > running over > an array that the type struct points to directly. We are not talking > about > hundreds of entries here, just a few. A linear scan in 64 bit steps > over > something like a hundred bytes in the L1 cache should hardly be > measurable.
I give up, then. I fail to understand the problem. Apparently, you want to do something with the value you get from this lookup operation, but that something won't involve function calls (or else the function call overhead for the lookup wouldn't be relevant).
In our specific case the value would be an offset added to the PyObject*, and there we would find a pointer to a C function (together with a 64-bit signature), and calling that C function (after checking the 64 bit signature) is our final objective.
I think the use case hasn't been communicated all that clearly yet. Let's give it another try.
Imagine we have two sides, one that provides a callable and the other side that wants to call it. Both sides are implemented in C, so the callee has a C signature and the caller has the arguments available as C data types. The signature may or may not match the argument types exactly (float vs. double, int vs. long, ...), because the caller and the callee know nothing about each other initially, they just happen to appear in the same program at runtime. All they know is that they could call each other through Python space, but that would require data conversion, tuple packing, calling, tuple unpacking, data unpacking, and then potentially the same thing on the way back. They want to avoid that overhead.
Now, the caller needs to figure out if the callee has a compatible signature. The callee may provide more than one signature (i.e. more than one C call entry point), perhaps because it is implemented to deal with different input data types efficiently, or perhaps because it can efficiently convert them to its expected input. So, there is a signature on the caller side given by the argument types it holds, and a couple of signature on the callee side that can accept different C data input. Then the caller needs to find out which signatures there are and match them against what it can efficiently call. It may even be a JIT compiler that can generate an efficient call signature on the fly, given a suitable signature on callee side.
An example for this is an algorithm that evaluates a user provided function on a large NumPy array. The caller knows what array type it is operating on, and the user provided function may be designed to efficiently operate on arrays of int, float and double entries.
Given that use case, can I suggest the following:
Separate the discovery of the function from its use. By this I mean first lookup the function (outside of the loop) then use the function (inside the loop).
We would obviously do that when we can. But Cython is a compiler/code translator, and we don't control usecases. You can easily make up usecases (= Cython code people write) where you can't easily separate the two.
For instance, the Sage projects has hundreds of thousands of lines of object-oriented Cython code (NOT just array-oriented, but also graphs and trees and stuff), which is all based on Cython's own fast vtable dispatches a la C++. They might want to clean up their code and more generic callback objects some places.
Other users currently pass around C pointers for callback functions, and we'd like to tell them "pass around these nicer Python callables instead, honestly, the penalty is only 2 ns per call". (*Regardless* of how you use them, like making sure you use them in a loop where we can statically pull out the function pointer acquisition. Saying "this is only non-sluggish if you do x, y, z puts users off.)
Why not pass around a PyCFunction object, instead of a C function pointer. It contains two fields: the function pointer and the object (self), which is exactly what you want.
Of course, the PyCFunction object only allows a limited range of function types, which is why I am suggesting a variant which supports a wider range of C function pointer types.
Is a single extra indirection in obj->func() rather than func(), really that inefficient? If you are passing around raw pointers, you have already given up on dynamic type checking.
I'm not asking you to consider the details of all that. Just to allow some kind of high-performance extensibility of PyTypeObject, so that we can *stop* bothering python-dev with specific requirements from our parallel universe of nearly-all-Cython-and-Fortran-and-C++ codebases :-)
If I read it correctly, you have two problems you wish to solve: 1. A fast callable that can be passed around (see above) 2. Fast access to that callable from a type.
The solution for 2. is the _PyType_Lookup() function. By the time you have fixed your proposed solution to properly handle subclassing I doubt it will be any quicker than _PyType_Lookup().
It is certainly (2) that we are most interested in solving here; (1) can be solved in a variety of ways. For this second point, we're looking for something that's faster than a dictionary lookup. (For example, common usecase is user-provided functions operating on C doubles which can be quite fast.) The PyTypeObject struct is in large part a list of methods that were deemed too common and time-critical to merit the dictionary lookup (and Python call) overhead. Unfortunately, it's not extensible. We figured it'd be useful to get any feedback from the large Python community on how best to add extensibility, in particular with an eye for being future-proof and possibly an official part of the standard for some future version of Python. - Robert
Robert Bradshaw wrote:
On Wed, May 16, 2012 at 8:40 AM, Mark Shannon wrote:
Dag Sverre Seljebotn wrote:
On 05/16/2012 02:47 PM, Mark Shannon wrote:
Stefan Behnel wrote:
Dag Sverre Seljebotn, 16.05.2012 12:48:
On 05/16/2012 11:50 AM, "Martin v. Löwis" wrote: >> Agreed in general, but in this case, it's really not that easy. A C >> function call involves a certain overhead all by itself, so calling >> into >> the C-API multiple times may be substantially more costly than, say, >> calling through a function pointer once and then running over a >> returned C >> array comparing numbers. And definitely way more costly than >> running over >> an array that the type struct points to directly. We are not talking >> about >> hundreds of entries here, just a few. A linear scan in 64 bit steps >> over >> something like a hundred bytes in the L1 cache should hardly be >> measurable. > I give up, then. I fail to understand the problem. Apparently, you > want > to do something with the value you get from this lookup operation, but > that something won't involve function calls (or else the function call > overhead for the lookup wouldn't be relevant). In our specific case the value would be an offset added to the PyObject*, and there we would find a pointer to a C function (together with a 64-bit signature), and calling that C function (after checking the 64 bit signature) is our final objective.
I think the use case hasn't been communicated all that clearly yet. Let's give it another try.
Imagine we have two sides, one that provides a callable and the other side that wants to call it. Both sides are implemented in C, so the callee has a C signature and the caller has the arguments available as C data types. The signature may or may not match the argument types exactly (float vs. double, int vs. long, ...), because the caller and the callee know nothing about each other initially, they just happen to appear in the same program at runtime. All they know is that they could call each other through Python space, but that would require data conversion, tuple packing, calling, tuple unpacking, data unpacking, and then potentially the same thing on the way back. They want to avoid that overhead.
Now, the caller needs to figure out if the callee has a compatible signature. The callee may provide more than one signature (i.e. more than one C call entry point), perhaps because it is implemented to deal with different input data types efficiently, or perhaps because it can efficiently convert them to its expected input. So, there is a signature on the caller side given by the argument types it holds, and a couple of signature on the callee side that can accept different C data input. Then the caller needs to find out which signatures there are and match them against what it can efficiently call. It may even be a JIT compiler that can generate an efficient call signature on the fly, given a suitable signature on callee side.
An example for this is an algorithm that evaluates a user provided function on a large NumPy array. The caller knows what array type it is operating on, and the user provided function may be designed to efficiently operate on arrays of int, float and double entries.
Given that use case, can I suggest the following:
Separate the discovery of the function from its use. By this I mean first lookup the function (outside of the loop) then use the function (inside the loop).
We would obviously do that when we can. But Cython is a compiler/code translator, and we don't control usecases. You can easily make up usecases (= Cython code people write) where you can't easily separate the two.
For instance, the Sage projects has hundreds of thousands of lines of object-oriented Cython code (NOT just array-oriented, but also graphs and trees and stuff), which is all based on Cython's own fast vtable dispatches a la C++. They might want to clean up their code and more generic callback objects some places.
Other users currently pass around C pointers for callback functions, and we'd like to tell them "pass around these nicer Python callables instead, honestly, the penalty is only 2 ns per call". (*Regardless* of how you use them, like making sure you use them in a loop where we can statically pull out the function pointer acquisition. Saying "this is only non-sluggish if you do x, y, z puts users off.)
Why not pass around a PyCFunction object, instead of a C function pointer. It contains two fields: the function pointer and the object (self), which is exactly what you want.
Of course, the PyCFunction object only allows a limited range of function types, which is why I am suggesting a variant which supports a wider range of C function pointer types.
Is a single extra indirection in obj->func() rather than func(), really that inefficient? If you are passing around raw pointers, you have already given up on dynamic type checking.
I'm not asking you to consider the details of all that. Just to allow some kind of high-performance extensibility of PyTypeObject, so that we can *stop* bothering python-dev with specific requirements from our parallel universe of nearly-all-Cython-and-Fortran-and-C++ codebases :-)
If I read it correctly, you have two problems you wish to solve: 1. A fast callable that can be passed around (see above) 2. Fast access to that callable from a type.
The solution for 2. is the _PyType_Lookup() function. By the time you have fixed your proposed solution to properly handle subclassing I doubt it will be any quicker than _PyType_Lookup().
It is certainly (2) that we are most interested in solving here; (1) can be solved in a variety of ways. For this second point, we're looking for something that's faster than a dictionary lookup. (For example, common usecase is user-provided functions operating on C doubles which can be quite fast.)
_PyType_Lookup() is fast; it doesn't perform any dictionary lookups if the (type, attribute) pair is in the cache.
The PyTypeObject struct is in large part a list of methods that were deemed too common and time-critical to merit the dictionary lookup (and Python call) overhead. Unfortunately, it's not extensible. We figured it'd be useful to get any feedback from the large Python community on how best to add extensibility, in particular with an eye for being future-proof and possibly an official part of the standard for some future version of Python.
I don't see any problem with making _PyType_Lookup() public. But others might disagree. Cheers, Mark.
Does this use case make sense to everyone?
The reason why we are discussing this on python-dev is that we are looking for a general way to expose these C level signatures within the Python ecosystem. And Dag's idea was to expose them as part of the type object, basically as an addition to the current Python level tp_call() slot.
The use case makes sense, yet there is also a long-standing solution already to expose APIs and function pointers: the capsule objects. If you want to avoid dictionary lookups on the server side, implement tp_getattro, comparing addresses of interned strings. Regards, Martin
On Wed, May 16, 2012 at 11:33 AM, "Martin v. Löwis"
Does this use case make sense to everyone?
The reason why we are discussing this on python-dev is that we are looking for a general way to expose these C level signatures within the Python ecosystem. And Dag's idea was to expose them as part of the type object, basically as an addition to the current Python level tp_call() slot.
The use case makes sense, yet there is also a long-standing solution already to expose APIs and function pointers: the capsule objects.
If you want to avoid dictionary lookups on the server side, implement tp_getattro, comparing addresses of interned strings.
Yes, that's an idea worth looking at. The point implementing tp_getattro to avoid dictionary lookups overhead is a good one, worth trying at least. One drawback is that this approach does require the GIL (as does _PyType_Lookup). Regarding the C function being faster than the dictionary lookup (or at least close enough that the lookup takes time), yes, this happens all the time. For example one might be solving differential equations and the "user input" is essentially a set of (usually simple) double f(double) and its derivatives. - Robert
On 05/16/2012 10:24 PM, Robert Bradshaw wrote:
On Wed, May 16, 2012 at 11:33 AM, "Martin v. Löwis"
wrote: Does this use case make sense to everyone?
The reason why we are discussing this on python-dev is that we are looking for a general way to expose these C level signatures within the Python ecosystem. And Dag's idea was to expose them as part of the type object, basically as an addition to the current Python level tp_call() slot.
The use case makes sense, yet there is also a long-standing solution already to expose APIs and function pointers: the capsule objects.
If you want to avoid dictionary lookups on the server side, implement tp_getattro, comparing addresses of interned strings.
Yes, that's an idea worth looking at. The point implementing tp_getattro to avoid dictionary lookups overhead is a good one, worth trying at least. One drawback is that this approach does require the GIL (as does _PyType_Lookup).
Regarding the C function being faster than the dictionary lookup (or at least close enough that the lookup takes time), yes, this happens all the time. For example one might be solving differential equations and the "user input" is essentially a set of (usually simple) double f(double) and its derivatives.
To underline how this is performance critical to us, perhaps a full Cython example is useful. The following Cython code is a real world usecase. It is not too contrived in the essentials, although simplified a little bit. For instance undergrad engineering students could pick up Cython just to play with simple scalar functions like this. from numpy import sin # assume sin is a Python callable and that NumPy decides to support # our spec to also support getting a "double (*sinfuncptr)(double)". # Our mission: Avoid to have the user manually import "sin" from C, # but allow just using the NumPy object and still be fast. # define a function to integrate cpdef double f(double x): return sin(x * x) # guess on signature and use "fastcall"! # the integrator def integrate(func, double a, double b, int n): cdef double s = 0 cdef double dx = (b - a) / n for i in range(n): # This is also a fastcall, but can be cached so doesn't # matter... s += func(a + i * dx) return s * dx integrate(f, 0, 1, 1000000) There are two problems here: - The "sin" global can be reassigned (monkey-patched) between each call to "f", no way for "f" to know. Even "sin" could do the reassignment. So you'd need to check for reassignment to do caching... - The fastcall inside of "f" is separated from the loop in "integrate". And since "f" is often in another module, we can't rely on static full program analysis. These problems with monkey-patching disappear if the lookup is negligible. Some rough numbers: - The overhead with the tp_flags hack is a 2 ns overhead (something similar with a metaclass, the problems are more how to synchronize that metaclass across multiple 3rd party libraries) - Dict lookup 20 ns - The sin function is about 35 ns. And, "f" is probably only 2-3 ns, and there could very easily be multiple such functions, defined in different modules, in a chain, in order to build up a formula. Dag
Dag Sverre Seljebotn wrote:
On 05/16/2012 10:24 PM, Robert Bradshaw wrote:
On Wed, May 16, 2012 at 11:33 AM, "Martin v. Löwis"
wrote: Does this use case make sense to everyone?
The reason why we are discussing this on python-dev is that we are looking for a general way to expose these C level signatures within the Python ecosystem. And Dag's idea was to expose them as part of the type object, basically as an addition to the current Python level tp_call() slot.
The use case makes sense, yet there is also a long-standing solution already to expose APIs and function pointers: the capsule objects.
If you want to avoid dictionary lookups on the server side, implement tp_getattro, comparing addresses of interned strings.
Yes, that's an idea worth looking at. The point implementing tp_getattro to avoid dictionary lookups overhead is a good one, worth trying at least. One drawback is that this approach does require the GIL (as does _PyType_Lookup).
Regarding the C function being faster than the dictionary lookup (or at least close enough that the lookup takes time), yes, this happens all the time. For example one might be solving differential equations and the "user input" is essentially a set of (usually simple) double f(double) and its derivatives.
To underline how this is performance critical to us, perhaps a full Cython example is useful.
The following Cython code is a real world usecase. It is not too contrived in the essentials, although simplified a little bit. For instance undergrad engineering students could pick up Cython just to play with simple scalar functions like this.
from numpy import sin # assume sin is a Python callable and that NumPy decides to support # our spec to also support getting a "double (*sinfuncptr)(double)".
# Our mission: Avoid to have the user manually import "sin" from C, # but allow just using the NumPy object and still be fast.
# define a function to integrate cpdef double f(double x): return sin(x * x) # guess on signature and use "fastcall"!
# the integrator def integrate(func, double a, double b, int n): cdef double s = 0 cdef double dx = (b - a) / n for i in range(n): # This is also a fastcall, but can be cached so doesn't # matter... s += func(a + i * dx) return s * dx
integrate(f, 0, 1, 1000000)
There are two problems here:
- The "sin" global can be reassigned (monkey-patched) between each call to "f", no way for "f" to know. Even "sin" could do the reassignment. So you'd need to check for reassignment to do caching...
Since Cython allows static typing why not just declare that func can treat sin as if it can't be monkeypatched? Moving the load of a global variable out of the loop does seem to be a rather obvious optimisation, if it were declared to be legal.
- The fastcall inside of "f" is separated from the loop in "integrate". And since "f" is often in another module, we can't rely on static full program analysis.
These problems with monkey-patching disappear if the lookup is negligible.
Some rough numbers:
- The overhead with the tp_flags hack is a 2 ns overhead (something similar with a metaclass, the problems are more how to synchronize that metaclass across multiple 3rd party libraries)
Does your approach handle subtyping properly?
- Dict lookup 20 ns
Did you time _PyType_Lookup() ?
- The sin function is about 35 ns. And, "f" is probably only 2-3 ns, and there could very easily be multiple such functions, defined in different modules, in a chain, in order to build up a formula.
Such micro timings are meaningless, because the working set often tends to fit in the hardware cache. A level 2 cache miss can takes 100s of cycles. Cheers, Mark.
Mark Shannon, 17.05.2012 12:38:
Dag Sverre Seljebotn wrote:
On 05/16/2012 10:24 PM, Robert Bradshaw wrote:
On Wed, May 16, 2012 at 11:33 AM, "Martin v. Löwis"
wrote: Does this use case make sense to everyone?
The reason why we are discussing this on python-dev is that we are looking for a general way to expose these C level signatures within the Python ecosystem. And Dag's idea was to expose them as part of the type object, basically as an addition to the current Python level tp_call() slot.
The use case makes sense, yet there is also a long-standing solution already to expose APIs and function pointers: the capsule objects.
If you want to avoid dictionary lookups on the server side, implement tp_getattro, comparing addresses of interned strings.
Yes, that's an idea worth looking at. The point implementing tp_getattro to avoid dictionary lookups overhead is a good one, worth trying at least. One drawback is that this approach does require the GIL (as does _PyType_Lookup).
Regarding the C function being faster than the dictionary lookup (or at least close enough that the lookup takes time), yes, this happens all the time. For example one might be solving differential equations and the "user input" is essentially a set of (usually simple) double f(double) and its derivatives.
To underline how this is performance critical to us, perhaps a full Cython example is useful.
The following Cython code is a real world usecase. It is not too contrived in the essentials, although simplified a little bit. For instance undergrad engineering students could pick up Cython just to play with simple scalar functions like this.
from numpy import sin # assume sin is a Python callable and that NumPy decides to support # our spec to also support getting a "double (*sinfuncptr)(double)".
# Our mission: Avoid to have the user manually import "sin" from C, # but allow just using the NumPy object and still be fast.
# define a function to integrate cpdef double f(double x): return sin(x * x) # guess on signature and use "fastcall"!
# the integrator def integrate(func, double a, double b, int n): cdef double s = 0 cdef double dx = (b - a) / n for i in range(n): # This is also a fastcall, but can be cached so doesn't # matter... s += func(a + i * dx) return s * dx
integrate(f, 0, 1, 1000000)
There are two problems here:
- The "sin" global can be reassigned (monkey-patched) between each call to "f", no way for "f" to know. Even "sin" could do the reassignment. So you'd need to check for reassignment to do caching...
Since Cython allows static typing why not just declare that func can treat sin as if it can't be monkeypatched?
You'd simply say cdef object sin # declare it as a C variable of type 'object' from numpy import sin That's also the one obvious way to do it in Cython.
Moving the load of a global variable out of the loop does seem to be a rather obvious optimisation, if it were declared to be legal.
My proposal was to simply extract any C function pointers at assignment time, i.e. at import time in the example above. Signature matching can then be done at the first call and the result can be cached as long as the object variable isn't changed. All of that is local to the module and can thus easily be controlled at code generation time. Stefan
Mark Shannon wrote:
Dag Sverre Seljebotn wrote:
from numpy import sin # assume sin is a Python callable and that NumPy decides to support # our spec to also support getting a "double (*sinfuncptr)(double)".
# Our mission: Avoid to have the user manually import "sin" from C, # but allow just using the NumPy object and still be fast.
# define a function to integrate cpdef double f(double x): return sin(x * x) # guess on signature and use "fastcall"!
# the integrator def integrate(func, double a, double b, int n): cdef double s = 0 cdef double dx = (b - a) / n for i in range(n): # This is also a fastcall, but can be cached so doesn't # matter... s += func(a + i * dx) return s * dx
integrate(f, 0, 1, 1000000)
There are two problems here:
- The "sin" global can be reassigned (monkey-patched) between each call to "f", no way for "f" to know. Even "sin" could do the reassignment. So you'd need to check for reassignment to do caching...
Since Cython allows static typing why not just declare that func can treat sin as if it can't be monkeypatched?
If you want to manually declare stuff, you can always use a C function pointer too...
Moving the load of a global variable out of the loop does seem to be a rather obvious optimisation, if it were declared to be legal.
In case you didn't notice, there was no global variable loads inside the loop... You can keep chasing this, but there's *always* cases where they don't (and you need to save the situation by manual typing). Anyway: We should really discuss Cython on the Cython list. If my motivating example wasn't good enough for you there's really nothing I can do.
Some rough numbers:
- The overhead with the tp_flags hack is a 2 ns overhead (something similar with a metaclass, the problems are more how to synchronize that metaclass across multiple 3rd party libraries)
Does your approach handle subtyping properly?
Not really.
- Dict lookup 20 ns
Did you time _PyType_Lookup() ?
No, didn't get around to it yet (and thanks for pointing it out). (Though the GIL requirement is an issue too for Cython.)
- The sin function is about 35 ns. And, "f" is probably only 2-3 ns,
and there could very easily be multiple such functions, defined in different modules, in a chain, in order to build up a formula.
Such micro timings are meaningless, because the working set often tends
to fit in the hardware cache. A level 2 cache miss can takes 100s of cycles.
I find this sort of response arrogant -- do you know the details of every usecase for a programming language under the sun? Many Cython users are scientists. And in scientific computing in particular you *really* have the whole range of problems and working sets. Honestly. In some codes you only really care about the speed of the disk controller. In other cases you can spend *many seconds* working almost only in L1 or perhaps L2 cache (for instance when integrating ordinary differential equations in a few variables, which is not entirely different in nature from the example I posted). (Then, those many seconds are replicated many million times for different parameters on a large cluster, and a 2x speedup translates directly into large amounts of saved money.) Also, with numerical codes you block up the problem so that loads to L2 are amortized over sufficient FLOPs (when you can). Every time Cython becomes able to do stuff more easily in this domain, people thank us that they didn't have to dig up Fortran but can stay closer to Python. Sorry for going off on a rant. I find that people will give well-meant advice about performance, but that advice is just generalizing from computer programs in entirely different domains (web apps?), and sweeping generalizations has a way of giving the wrong answer. Dag
On 05/17/2012 08:13 PM, Dag Sverre Seljebotn wrote:
Mark Shannon wrote:
Dag Sverre Seljebotn wrote:
from numpy import sin # assume sin is a Python callable and that NumPy decides to support # our spec to also support getting a "double (*sinfuncptr)(double)".
# Our mission: Avoid to have the user manually import "sin" from C, # but allow just using the NumPy object and still be fast.
# define a function to integrate cpdef double f(double x): return sin(x * x) # guess on signature and use "fastcall"!
# the integrator def integrate(func, double a, double b, int n): cdef double s = 0 cdef double dx = (b - a) / n for i in range(n): # This is also a fastcall, but can be cached so doesn't # matter... s += func(a + i * dx) return s * dx
integrate(f, 0, 1, 1000000)
There are two problems here:
- The "sin" global can be reassigned (monkey-patched) between each call to "f", no way for "f" to know. Even "sin" could do the reassignment. So you'd need to check for reassignment to do caching...
Since Cython allows static typing why not just declare that func can treat sin as if it can't be monkeypatched?
If you want to manually declare stuff, you can always use a C function pointer too...
Moving the load of a global variable out of the loop does seem to be a rather obvious optimisation, if it were declared to be legal.
In case you didn't notice, there was no global variable loads inside the loop...
You can keep chasing this, but there's *always* cases where they don't (and you need to save the situation by manual typing).
Anyway: We should really discuss Cython on the Cython list. If my motivating example wasn't good enough for you there's really nothing I can do.
Some rough numbers:
- The overhead with the tp_flags hack is a 2 ns overhead (something similar with a metaclass, the problems are more how to synchronize that metaclass across multiple 3rd party libraries)
Does your approach handle subtyping properly?
Not really.
- Dict lookup 20 ns
Did you time _PyType_Lookup() ?
No, didn't get around to it yet (and thanks for pointing it out). (Though the GIL requirement is an issue too for Cython.)
- The sin function is about 35 ns. And, "f" is probably only 2-3 ns,
and there could very easily be multiple such functions, defined in different modules, in a chain, in order to build up a formula.
Such micro timings are meaningless, because the working set often tends
to fit in the hardware cache. A level 2 cache miss can takes 100s of cycles.
I'm sorry; if my rant wasn't clear: Such micro-benchmarks do in fact mimic very closely what you'd do if you'd, say, integrate an ordinary differential equation. You *do* have a tight loop like that, just hammering on floating point numbers. Making that specific usecase more convenient was actually the original usecase that spawned this discussion on the NumPy list over a month ago... Dag
I find this sort of response arrogant -- do you know the details of every usecase for a programming language under the sun?
Many Cython users are scientists. And in scientific computing in particular you *really* have the whole range of problems and working sets. Honestly. In some codes you only really care about the speed of the disk controller. In other cases you can spend *many seconds* working almost only in L1 or perhaps L2 cache (for instance when integrating ordinary differential equations in a few variables, which is not entirely different in nature from the example I posted). (Then, those many seconds are replicated many million times for different parameters on a large cluster, and a 2x speedup translates directly into large amounts of saved money.)
Also, with numerical codes you block up the problem so that loads to L2 are amortized over sufficient FLOPs (when you can).
Every time Cython becomes able to do stuff more easily in this domain, people thank us that they didn't have to dig up Fortran but can stay closer to Python.
Sorry for going off on a rant. I find that people will give well-meant advice about performance, but that advice is just generalizing from computer programs in entirely different domains (web apps?), and sweeping generalizations has a way of giving the wrong answer.
Dag _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/d.s.seljebotn%40astro.uio....
On Thu, 17 May 2012 20:13:41 +0200, Dag Sverre Seljebotn
Every time Cython becomes able to do stuff more easily in this domain, people thank us that they didn't have to dig up Fortran but can stay closer to Python.
Sorry for going off on a rant. I find that people will give well-meant advice about performance, but that advice is just generalizing from computer programs in entirely different domains (web apps?), and sweeping generalizations has a way of giving the wrong answer.
I don't have opinions on the specific topic under discussion, since I don't get involved in the C level stuff unless I have to, but I do have some small amount of background in scientific computing (many years ago). I just want to chime in to say that I think it benefits the whole Python community to to extend welcoming arms to the scientific Python community and see what we can do to help them (without, of course, compromising Python). I think it is safe to assume that they do have significant experience with real applications where timings at this level of detail do matter. The scientific computing community is pretty much by definition pushing the limits of what's possible. --David
In our specific case the value would be an offset added to the PyObject*, and there we would find a pointer to a C function (together with a 64-bit signature), and calling that C function (after checking the 64 bit signature) is our final objective.
And what the C function does really is faster than the lookup through a dictionary? I find that hard to believe.
I still think this is out of scope for python-dev. If this is something you want to be able to do for Python 2.4 as well, then you don't need any change to Python - you can do whatever you come up with for all Python versions, no need to (or point in) changing Python 3.4 (say).
We can go ahead and hijack tp_flags bit 22 to make things work in existing versions. But what if Python 3.8 then starts using that bit for something else?
Use flag bit 23 in Python 3.8. You know at compile time what Python version you have.
As this is apparently only relevant to speed fanatics, too, I suggest that you check how fast PyPI works for you.
Did you mean PyPy?
Oops, yes - Freudian slip :-) Regards, Martin
Martin v. Löwis wrote:
And, we want this to somehow work with existing Python; we still support users on Python 2.4.
This makes the question out-of-scope for python-dev - we only discuss new versions of Python here. Old versions cannot be developed anymore (as they are released already).
typedef struct { unsigned long extension_id; void *data; } PyTypeObjectExtensionEntry;
and then a type object can (somehow!) point to an array of these. The array is linearly scanned
It's unclear to me why you think that a linear scan is faster than a dictionary lookup. The contrary will be the case - the dictionary lookup (PyObject_GetAttr) will be much faster.
PyObject_GetAttr does a lot more than just a dictionary lookup. Perhaps making _PyType_Lookup() public might provide what it is needed?
Just make sure to use interned strings, and to cache the interned strings.
Regards, Martin
_______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/mark%40hotpy.org
participants (9)
-
"Martin v. Löwis"
-
Dag Sverre Seljebotn
-
Greg Ewing
-
Mark Shannon
-
martin@v.loewis.de
-
Nick Coghlan
-
R. David Murray
-
Robert Bradshaw
-
Stefan Behnel