[Python-Dev] Removing the block stack (was Re: PEP 343 and __with__)
Phillip J. Eby
pje at telecommunity.com
Thu Oct 6 09:06:39 CEST 2005
At 10:09 PM 10/5/2005 -0700, Neal Norwitz wrote:
>I've also been thinking about avoiding tuple creation when calling
>python functions. The change I have in mind would probably have to
>wait until p3k, but could yield some speed ups.
>Warning: half baked idea follows.
Yeah, I've been baking that idea for a long time, and it's a bit more
complex than you've suggested, due to generators, sys._getframe(), and
>My thoughts are to dynamically allocate the Python stack memory (e.g.,
>void *stack = malloc(128MB)). Then all calls within each thread uses
>its own stack. So things would be pushed onto the stack like they are
>currently, but we wouldn't need to do create a tuple to pass to a
>method, they could just be used directly. Basically more closely
>simulate the way it currently works in hardware.
Actually, Python/ceval.c already skips creating a tuple when calling Python
functions with a fixed number of arguments (caller and callee) and no cell
vars (i.e., not a closure). It copies them straight from the calling frame
stack to the callee frame's stack.
>This would mean all the PyArg_ParseTuple()s would have to change. It
>may be possible to fake it out, but I'm not sure it's worth it which
>is why it would be easier to do this for p3k.
Actually, I've been thinking that replacing the arg tuple with a PyObject*
array would allow us to skip tuple creation when calling C functions, since
you could just give the C functions a pointer to the arguments on the
caller's stack. That would let us get rid of most remaining tuple
allocations. I suppose we'd also need either an argcount parameter. The
old APIs taking tuples for calls could trivially convert the tuples to a
array pointer and size, then call the new APIs.
Actually, we'd probably have to have a tp_arraycall slot or something, with
the existing tp_call forwarding to tp_arraycall in most cases, but
occasionally the reverse. The tricky part is making sure you don't end up
with cases where you call a tuple API that converts to an array that then
turns it back into a tuple!
>The general idea is to allocate the stack in one big hunk and just
>walk up/down it as functions are called/returned. This only means
>incrementing or decrementing pointers. This should allow us to avoid
>a bunch of copying and tuple creation/destruction. Frames would
>hopefully be the same size which would help. Note that even though
>there is a free list for frames, there could still be
>PyObject_GC_Resize()s often (or unused memory). WIth my idea,
>hopefully there would be better memory locality, which could speed
Yeah, unfortunately for your idea, generators would have to copy off bits
of the stack and then copy them back in, making generators slower. If it
weren't for that part, the idea would probably be a good one, as arguments,
locals, cells, and the block and value stacks could all be handled that
way, with the compiler treating all operations as base-pointer offsets,
thereby eliminating lots of more-complex pointer management in ceval.c and
Another possible fix for generators would be of course to give them their
own stack arena, but then you have the problem of needing to copy overflows
from one such stack to another - at which point you're basically back to
On the other hand, maybe the good part of this idea is just eliminating all
the pointer fudging and having the compiler determine stack offsets. Then,
the frame object layout would just consist of a big hunk of stack space,
laid out as a PyObject* array.
The main problem with this concept is that it would change the meaning of
certain opcodes, since right now the offsets of free variables in opcodes
start over the numbering, but this approach would add the number of locals
to those offsets.
More information about the Python-Dev