Idea for a fast calling convention
I've been working on ways to speedup parameter passing for function calls. It's somewhat expensive to pull parameters off of the stack, assemble them into a new tuple, have the function disassemble the tuple, and ultimately free the tuple. METH_NOARGS and METH_O show nice speed-ups by skipping the tuple packing and unpacking. Since they are handled as special cases, that approach doesn't handle the general case with multiple or optional arguments. My idea is for a new flag, METH_STACK, that generalizes (and potentially replaces) METH_NOARGS AND METH_O. When CALL_FUNCTION sees the flag, it dispatches (*meth)(self, &stack, numargs). On the receiving end, the arguments get retrieved with analogues to PyArg_ParseTuple() and PyArg_UnpackTuple() which access the parameters directly and do not need a tuple as a carrier. Raymond Hettinger
"Raymond Hettinger" <raymond.hettinger@verizon.net> writes:
I've been working on ways to speedup parameter passing for function calls. It's somewhat expensive to pull parameters off of the stack, assemble them into a new tuple, have the function disassemble the tuple, and ultimately free the tuple.
Yahbut... that doesn't actually happen all that often.
METH_NOARGS and METH_O show nice speed-ups by skipping the tuple packing and unpacking. Since they are handled as special cases, that approach doesn't handle the general case with multiple or optional arguments.
Oh, you're talking about builtin functions.
My idea is for a new flag, METH_STACK, that generalizes (and potentially replaces) METH_NOARGS AND METH_O.
When CALL_FUNCTION sees the flag, it dispatches (*meth)(self, &stack, numargs).
On the receiving end, the arguments get retrieved with analogues to PyArg_ParseTuple() and PyArg_UnpackTuple() which access the parameters directly and do not need a tuple as a carrier.
Have you seen my "function optimization reorganization" patch on SF? It's a somewhat different result of somewhat similar thinking. http://python.org/sf/876193 Cheers, mwh -- Java sucks. [...] Java on TV set top boxes will suck so hard it might well inhale people from off their sofa until their heads get wedged in the card slots. --- Jon Rabone, ucam.chat
On Friday 27 February 2004 03:20 am, Raymond Hettinger wrote:
I've been working on ways to speedup parameter passing for function calls. It's somewhat expensive to pull parameters off of the stack, assemble them into a new tuple, have the function disassemble the tuple, and ultimately free the tuple. ... My idea is for a new flag, METH_STACK, that generalizes (and potentially replaces) METH_NOARGS AND METH_O.
How would METH_STACK methods work when called via func(*args) or via the various functions of the C API? The nice thing about METH_NOARGS and METH_O is that they describe what the called function needs. METH_STACK sounds like a core dump waiting to happen. ;-( -Fred -- Fred L. Drake, Jr. <fdrake at acm.org> PythonLabs at Zope Corporation
At 04:49 PM 2/27/04 -0500, Fred L. Drake, Jr. wrote:
On Friday 27 February 2004 03:20 am, Raymond Hettinger wrote:
I've been working on ways to speedup parameter passing for function calls. It's somewhat expensive to pull parameters off of the stack, assemble them into a new tuple, have the function disassemble the tuple, and ultimately free the tuple. .. My idea is for a new flag, METH_STACK, that generalizes (and potentially replaces) METH_NOARGS AND METH_O.
How would METH_STACK methods work when called via func(*args) or via the various functions of the C API?
The nice thing about METH_NOARGS and METH_O is that they describe what the called function needs. METH_STACK sounds like a core dump waiting to happen. ;-(
Maybe there should instead be a tp_call_stack slot. Then the various CALL opcodes would call that slot instead of tp_call. C API calls would still go through tp_call. 'object' would define tp_call_stack to call tp_call, using information from the stack. Python function and method objects would then override tp_call_stack to implement the shortcut form. In practice, though, I expect it would be faster to do as Jython and IronPython have done, and define a set of tp_call1, tp_call2, etc. slots that are optimized for specific calling situations, allowing C API calls to be sped up as well, provided you used things like PyObject_Call1(ob,arg), PyObject_Call2(ob,arg1,arg2), and so on. Perhaps there is some information that can be gleaned from the Jython research as to what are the most common number of positional parameters for calls.
"Phillip J. Eby" <pje@telecommunity.com> writes:
Maybe there should instead be a tp_call_stack slot. Then the various CALL opcodes would call that slot instead of tp_call. C API calls would still go through tp_call.
People *really* should look at the patch I mentioned...
In practice, though, I expect it would be faster to do as Jython and IronPython have done, and define a set of tp_call1, tp_call2, etc. slots that are optimized for specific calling situations, allowing C API calls to be sped up as well, provided you used things like PyObject_Call1(ob,arg), PyObject_Call2(ob,arg1,arg2), and so on.
I think this only really helps when you have a JIT compiler of some sort?
Perhaps there is some information that can be gleaned from the Jython research as to what are the most common number of positional parameters for calls.
That's easy: 0 then 1 then 2 then 3 then insignificant. Only a guess, but one I'm fairly confident of. Cheers, mwh -- ARTHUR: Don't ask me how it works or I'll start to whimper. -- The Hitch-Hikers Guide to the Galaxy, Episode 11
At 12:39 PM 2/28/04 +0000, Michael Hudson wrote:
"Phillip J. Eby" <pje@telecommunity.com> writes:
Maybe there should instead be a tp_call_stack slot. Then the various CALL opcodes would call that slot instead of tp_call. C API calls would still go through tp_call.
People *really* should look at the patch I mentioned...
Ah, I see you've implemented this already. :)
In practice, though, I expect it would be faster to do as Jython and IronPython have done, and define a set of tp_call1, tp_call2, etc. slots that are optimized for specific calling situations, allowing C API calls to be sped up as well, provided you used things like PyObject_Call1(ob,arg), PyObject_Call2(ob,arg1,arg2), and so on.
I think this only really helps when you have a JIT compiler of some sort?
No, all you need is a mechanism similar to the one in your patch, where the creation of a method or function object includes the selecting of a function pointer for each of the call signatures. So, when creating these objects, you'd select a call0 pointer, a call1 pointer, and so on. If the function requires more arguments than provided for that slot, you just point it to an "insufficient args" version of the function. If the function has defaults that would be filled in, you point it to a version that fills in the defaults, and so on. So, there's not a JIT required, but making the patch could be tedious, repetitive and error prone. Macros could possibly help a bit with the repetition and tedium, although not necessarily the error-prone part. :)
Perhaps there is some information that can be gleaned from the Jython research as to what are the most common number of positional parameters for calls.
That's easy: 0 then 1 then 2 then 3 then insignificant.
Actually, I'd find that a bit surprising, as I would expect 1-arg calls to be more popular than 0-argument calls. But I guess if you look at a method call, it could be a 0-argument call to the method, that maps to a 1-argument call on the function. But, I guess that would at any rate put 0-argument calls in the running. Anyway, I guess a 0-argument call for an instancemethod would look something like: PyObject * method_call0(PyObject *self) { if (self->im_self) { return PyObject_Call1(self->im_func, self->im_self); } else { return PyObject_Call0(self->im_func); } } And then you'd repeat this pattern for call1 and call2, but call3 would have to fall back to creating an argument tuple if we're only going up to call3 slots. Interestingly, the patterns involved in making these functions are basically left currying (methods) and right currying (default args). It may be that there's some way to generalize this, perhaps by using wrapping objects. On the other hand, that would tend to go against the intended speedup.
ARTHUR: Don't ask me how it works or I'll start to whimper. -- The Hitch-Hikers Guide to the Galaxy, Episode 11
Heh. Did you pick this quote intentionally? :)
"Phillip J. Eby" <pje@telecommunity.com> writes:
At 12:39 PM 2/28/04 +0000, Michael Hudson wrote:
"Phillip J. Eby" <pje@telecommunity.com> writes:
Maybe there should instead be a tp_call_stack slot. Then the various CALL opcodes would call that slot instead of tp_call. C API calls would still go through tp_call.
People *really* should look at the patch I mentioned...
Ah, I see you've implemented this already. :)
:-)
In practice, though, I expect it would be faster to do as Jython and IronPython have done, and define a set of tp_call1, tp_call2, etc. slots that are optimized for specific calling situations, allowing C API calls to be sped up as well, provided you used things like PyObject_Call1(ob,arg), PyObject_Call2(ob,arg1,arg2), and so on.
I think this only really helps when you have a JIT compiler of some sort?
No, all you need is a mechanism similar to the one in your patch, where the creation of a method or function object includes the selecting of a function pointer for each of the call signatures. So, when creating these objects, you'd select a call0 pointer, a call1 pointer, and so on. If the function requires more arguments than provided for that slot, you just point it to an "insufficient args" version of the function. If the function has defaults that would be filled in, you point it to a version that fills in the defaults, and so on.
Well, this is psyco-style code explosion by hand :-) Doesn't sound like fun. Incidentally, I think it would be goodness if PyCFunctions exposed more info about the arguments they take.
So, there's not a JIT required, but making the patch could be tedious, repetitive and error prone. Macros could possibly help a bit with the repetition and tedium, although not necessarily the error-prone part. :)
It probably wouldn't be seriously hard to write a program writing program for this.
Perhaps there is some information that can be gleaned from the Jython research as to what are the most common number of positional parameters for calls.
That's easy: 0 then 1 then 2 then 3 then insignificant.
Actually, I'd find that a bit surprising, as I would expect 1-arg calls to be more popular than 0-argument calls. But I guess if you look at a method call, it could be a 0-argument call to the method, that maps to a 1-argument call on the function. But, I guess that would at any rate put 0-argument calls in the running.
Well, it was just a guess. Maybe METH_O is more common.
Anyway, I guess a 0-argument call for an instancemethod would look something like:
Actually, this isn't much how things work now.
PyObject * method_call0(PyObject *self) { if (self->im_self) { return PyObject_Call1(self->im_func, self->im_self); } else { return PyObject_Call0(self->im_func); } }
but, yes.
And then you'd repeat this pattern for call1 and call2, but call3 would have to fall back to creating an argument tuple if we're only going up to call3 slots.
Interestingly, the patterns involved in making these functions are basically left currying (methods) and right currying (default args). It may be that there's some way to generalize this, perhaps by using wrapping objects. On the other hand, that would tend to go against the intended speedup.
I don't think this approach is going to yield mega speedups whatever. We need Pysco or similar tech for that, I think.
ARTHUR: Don't ask me how it works or I'll start to whimper. -- The Hitch-Hikers Guide to the Galaxy, Episode 11
Heh. Did you pick this quote intentionally? :)
No, but that quote applies to a worrying fraction of my posts... Cheers, mwh -- Or here's an even simpler indicator of how much C++ sucks: Print out the C++ Public Review Document. Have someone hold it about three feet above your head and then drop it. Thus you will be enlightened. -- Thant Tessman
Michael Hudson wrote: ...
Well, this is psyco-style code explosion by hand :-) Doesn't sound like fun.
Well, why not write a little script that generates the code, as part of the build process? And maybe making it an option, for those who fear code bloat?
Incidentally, I think it would be goodness if PyCFunctions exposed more info about the arguments they take.
So, there's not a JIT required, but making the patch could be tedious, repetitive and error prone. Macros could possibly help a bit with the repetition and tedium, although not necessarily the error-prone part. :)
It probably wouldn't be seriously hard to write a program writing program for this.
Ah, oh I see, my message is redundant. I'd like to encourage this. sending it anyway, it's to bad about all the fine characters -- chris p.s.: I believe some automatic source analysis and rewrite might pay off in other areas as well. Grepping through the sources, there are still very many similar patterns of PyArg_ParseTupleXXX calls, which could be replaced by less general, optimized versions. This would even *not* cause code bloat, since all those calling sequences would be smaller than now. -- Christian Tismer :^) <mailto:tismer@stackless.com> Mission Impossible 5oftware : Have a break! Take a ride on Python's Johannes-Niemeyer-Weg 9a : *Starship* http://starship.python.net/ 14109 Berlin : PGP key -> http://wwwkeys.pgp.net/ work +49 30 89 09 53 34 home +49 30 802 86 56 mobile +49 173 24 18 776 PGP 0x57F3BF04 9064 F4E1 D754 C2FF 1619 305B C09C 5A3B 57F3 BF04 whom do you want to sponsor today? http://www.stackless.com/
Christian Tismer <tismer@stackless.com> writes:
p.s.: I believe some automatic source analysis and rewrite might pay off in other areas as well. Grepping through the sources, there are still very many similar patterns of PyArg_ParseTupleXXX calls, which could be replaced by less general, optimized versions. This would even *not* cause code bloat, since all those calling sequences would be smaller than now.
Well, yes. C sucks seriously for things like this, though. It's frankly embarassing that *every* time, say, ''.split() is called, some silly string is being parsed. Unclear what to do about this (excpet PyPy, I guess). Cheers, mwh -- I have a feeling that any simple problem can be made arbitrarily difficult by imposing a suitably heavy administrative process around the development. -- Joe Armstrong, comp.lang.functional
On Mar 1, 2004, at 5:52 AM, Michael Hudson wrote:
Christian Tismer <tismer@stackless.com> writes:
p.s.: I believe some automatic source analysis and rewrite might pay off in other areas as well. Grepping through the sources, there are still very many similar patterns of PyArg_ParseTupleXXX calls, which could be replaced by less general, optimized versions. This would even *not* cause code bloat, since all those calling sequences would be smaller than now.
Well, yes. C sucks seriously for things like this, though. It's frankly embarassing that *every* time, say, ''.split() is called, some silly string is being parsed. Unclear what to do about this (excpet PyPy, I guess).
Surely there's other reasonable options. For example, we could start using something like Pyrex that could be modified to generate whatever gnarly C code needs to happen for optimal runtime performance with minimal input ugliness :) -bob
Bob Ippolito <bob@redivi.com> writes:
On Mar 1, 2004, at 5:52 AM, Michael Hudson wrote:
Well, yes. C sucks seriously for things like this, though. It's frankly embarassing that *every* time, say, ''.split() is called, some silly string is being parsed. Unclear what to do about this (excpet PyPy, I guess).
Surely there's other reasonable options. For example, we could start using something like Pyrex that could be modified to generate whatever gnarly C code needs to happen for optimal runtime performance with minimal input ugliness :)
Hmm, yes, I hadn't thought of pyrex. It also hadn't occured to me that pyrex might be desirable because the result might be more efficient, but I guess that's not so surprising. Cheers, mwh -- Guido (like us!) is a bit schizophrenic here: he wants to be a benevolent dictator, but also wants to treat people like grownups. This probably worked better before Python got a large American audience <0.9 wink>. -- Tim Peters, 10 Feb 2000
Michael Hudson wrote:
Christian Tismer <tismer@stackless.com> writes:
p.s.: I believe some automatic source analysis and rewrite might pay off in other areas as well. Grepping through the sources, there are still very many similar patterns of PyArg_ParseTupleXXX calls, which could be replaced by less general, optimized versions. This would even *not* cause code bloat, since all those calling sequences would be smaller than now.
Well, yes. C sucks seriously for things like this, though. It's frankly embarassing that *every* time, say, ''.split() is called, some silly string is being parsed. Unclear what to do about this (excpet PyPy, I guess).
Why not a switch over the arg tuple size? Or do you mean, this happens in so many different places that it would be tedious to change that by hand. Sure, we should not try to morph Pytho into PyPy in the first place. But maybe some runtime analysis with a modified version of PyArg_Parse... things, we could see how often which format string appears, and how often which optional paths are used, as a hint for hand optimisation. I could do that with Stackless, btw. For Win32, I have a few introspection tools which can do C stack analysis from a debug build and find out what is being called. They were meant for later C stack synthesis, which never worked, but this would give them some use, again. ciao - chris -- Christian Tismer :^) <mailto:tismer@stackless.com> Mission Impossible 5oftware : Have a break! Take a ride on Python's Johannes-Niemeyer-Weg 9a : *Starship* http://starship.python.net/ 14109 Berlin : PGP key -> http://wwwkeys.pgp.net/ work +49 30 89 09 53 34 home +49 30 802 86 56 mobile +49 173 24 18 776 PGP 0x57F3BF04 9064 F4E1 D754 C2FF 1619 305B C09C 5A3B 57F3 BF04 whom do you want to sponsor today? http://www.stackless.com/
Michael Hudson wrote:
...
Well, yes. C sucks seriously for things like this, though. It's frankly embarassing that *every* time, say, ''.split() is called, some silly string is being parsed. Unclear what to do about this (excpet PyPy, I guess).
Or Pyrex. Obviously either PyPy or Pyrex takes quite a bit of code rewriting. Fun rewriting but rewriting nevertheless. If we're going to require rewriting isn't there a more short-term way to simply eliminate the requirement to use PyArgParseTuple in most cases? If functions declared their type signatures in a new-fangled PyMethodDef then Python could optimize away common cases at module-initialization or type-initialization time. I don't understand why we usually (as opposed to occasionally) want to declare our type signature only when the function is run rather than earlier when the runtime could do useful stuff with it. Putting aside performance for a second: introspection on C functions is pretty weak for the same reason. Paul Prescod
"Phillip J. Eby" <pje@telecommunity.com> writes:
Perhaps there is some information that can be gleaned from the Jython research as to what are the most common number of positional parameters for calls.
Queinnec, "Lisp in Small Pieces", p 239: arity 0 1 2 3 4 5 6 7 8 frequency(%) 35 30 18 9 4 1 0 0 0 cumulative(%) 35 66 84 93 97 99 99 99 100 at least for the 1,988 scheme functions in the book. He thinks 0 arity is over-represented due to some thunking techniques that he employs. -- KBK
My idea is for a new flag, METH_STACK, that generalizes (and potentially replaces) METH_NOARGS AND METH_O.
How would METH_STACK methods work when called via func(*args)
The args tuple is converted to an equivalent stack reference.
or via the various functions of the C API?
The nice thing about METH_NOARGS and METH_O is that they describe what
There would by an analogue to Py_BuildValue(). the
called function needs.
This is closer to METH_VARARGS which lets the function itself handle the arguments.
METH_STACK sounds like a core dump waiting to happen.
It is no more or less a risk of a core dump than METH_VARARGS. Pretty much if any C function expects one thing and gets another (passing an pointer to the wrong structure or somesuch), then something bad happends. Right now, if you write f(x,y,z) where f is a PyCFunction, then ceval.c will execute CALL_FUNCTION 3 by pulling off three stack variables, load them into a new tuple, pass the tuple to the function, and the function will call either PyArg_ParseTuple() or PyArg_UnpackTuple(). With the new way, CALL_FUNCTION 3 calls the method with a stack pointer and the argument count (essentially the same data as the tuple) and the function calls PyArg_ParseStack() or PyArg_UnpackStack() which verifies the number of arguments and pulls the data off of the stack. All of the steps are encapsulated. Essentially, the only change is transferred the responsibility for pulling data off of the stack from ceval.c to the function that actually uses the data. This saves the unnecessary tuple carrier. Raymond
On Fri, 2004-02-27 at 03:20, Raymond Hettinger wrote:
My idea is for a new flag, METH_STACK, that generalizes (and potentially replaces) METH_NOARGS AND METH_O.
When CALL_FUNCTION sees the flag, it dispatches (*meth)(self, &stack, numargs).
You're working specifically and calls to C functions, right? The fast_funtion() path essentially does this for functions coded in Python.
On the receiving end, the arguments get retrieved with analogues to PyArg_ParseTuple() and PyArg_UnpackTuple() which access the parameters directly and do not need a tuple as a carrier.
I like Chris's suggestion to get rid of ParseTuple() where it's still needed. The primary motivation for METH_O IIRC was to avoid the overhead of calling PyArg_ParseTuple(); the ability to eliminate the tuple overhead was gravy. There's one idea I had related to calling Python functions, not sure if it makes much sense. We push arguments onto the caller's stack then copy them into the callee's frame. I wonder if there's a way to pre-allocate the callee's frame and "push" the arguments into that frame directly. It may not make much sense, because copying the arguments from stack to frame probably isn't a big expense. Jeremy
participants (9)
-
Bob Ippolito
-
Christian Tismer
-
Fred L. Drake, Jr.
-
Jeremy Hylton
-
kbk@shore.net
-
Michael Hudson
-
Paul Prescod
-
Phillip J. Eby
-
Raymond Hettinger