[Python-Dev] Removing the block stack (was Re: PEP 343 and with)

Thu Oct 6 09:06:39 CEST 2005

At 10:09 PM 10/5/2005 -0700, Neal Norwitz wrote:
>I've also been thinking about avoiding tuple creation when calling
>python functions.  The change I have in mind would probably have to
>wait until p3k, but could yield some speed ups.
>
>Warning:  half baked idea follows.

Yeah, I've been baking that idea for a long time, and it's a bit more 
complex than you've suggested, due to generators, sys._getframe(), and 
tracebacks.

>My thoughts are to dynamically allocate the Python stack memory (e.g.,
>void *stack = malloc(128MB)).  Then all calls within each thread uses
>its own stack.  So things would be pushed onto the stack like they are
>currently, but we wouldn't need to do create a tuple to pass to a
>method, they could just be used directly.  Basically more closely
>simulate the way it currently works in hardware.

Actually, Python/ceval.c already skips creating a tuple when calling Python 
functions with a fixed number of arguments (caller and callee) and no cell 
vars (i.e., not a closure).  It copies them straight from the calling frame 
stack to the callee frame's stack.

>This would mean all the PyArg_ParseTuple()s would have to change.  It
>may be possible to fake it out, but I'm not sure it's worth it which
>is why it would be easier to do this for p3k.

Actually, I've been thinking that replacing the arg tuple with a PyObject* 
array would allow us to skip tuple creation when calling C functions, since 
you could just give the C functions a pointer to the arguments on the 
caller's stack.  That would let us get rid of most remaining tuple 
allocations.  I suppose we'd also need either an argcount parameter.  The 
old APIs taking tuples for calls could trivially convert the tuples to a 
array pointer and size, then call the new APIs.

Actually, we'd probably have to have a tp_arraycall slot or something, with 
the existing tp_call forwarding to tp_arraycall in most cases, but 
occasionally the reverse.  The tricky part is making sure you don't end up 
with cases where you call a tuple API that converts to an array that then 
turns it back into a tuple!

>The general idea is to allocate the stack in one big hunk and just
>walk up/down it as functions are called/returned.  This only means
>incrementing or decrementing pointers.  This should allow us to avoid
>a bunch of copying and tuple creation/destruction.  Frames would
>hopefully be the same size which would help.  Note that even though
>there is a free list for frames, there could still be
>PyObject_GC_Resize()s often (or unused memory).  WIth my idea,
>hopefully there would be better memory locality, which could speed
>things up.

Yeah, unfortunately for your idea, generators would have to copy off bits 
of the stack and then copy them back in, making generators slower.  If it 
weren't for that part, the idea would probably be a good one, as arguments, 
locals, cells, and the block and value stacks could all be handled that 
way, with the compiler treating all operations as base-pointer offsets, 
thereby eliminating lots of more-complex pointer management in ceval.c and 
frameobject.c.

Another possible fix for generators would be of course to give them their 
own stack arena, but then you have the problem of needing to copy overflows 
from one such stack to another - at which point you're basically back to 
having frames.

On the other hand, maybe the good part of this idea is just eliminating all 
the pointer fudging and having the compiler determine stack offsets.  Then, 
the frame object layout would just consist of a big hunk of stack space, 
laid out as a PyObject* array.

The main problem with this concept is that it would change the meaning of 
certain opcodes, since right now the offsets of free variables in opcodes 
start over the numbering, but this approach would add the number of locals 
to those offsets.

[Python-Dev] Removing the block stack (was Re: PEP 343 and __with__)

[Python-Dev] Removing the block stack (was Re: PEP 343 and with)