[Python-Dev] Eliminating the block stack (was Re: Store x Load x --> DupStore)

Sun Feb 20 22:34:50 CET 2005

At 03:56 PM 2/20/05 -0500, Phillip J. Eby wrote:
>At 07:00 PM 2/20/05 +0000, Michael Hudson wrote:
>>Eliminating the blockstack would be nice (esp. if it's enough to get
>>frames small enough that they get allocated by PyMalloc) but this
>>seemed to be tricky too (or at least Armin, Samuele and I spent a
>>cuple of hours yakking about it on IRC and didn't come up with a clear
>>approach).  Dynamically allocating the blockstack would be simpler,
>>and might acheive a similar win.  (This is all from memory, I haven't
>>thought about specifics in a while).

I think I have an idea how to do it in a (relatively) simple fashion; see 
if you can find a hole in it:

* Change the PyTryBlock struct to include an additional member, 'int 
b_prev', that refers to the previous block in a chain

* Change the compiler's emission of SETUP_* opcodes, so that instead of a 
PyTryBlock being added to the blockstack at interpretation time, it's added 
to the end of a 'co_blktree' block array at compile time, with its 'b_prev' 
pointing to the current "top" of the block stack.  Instead of the SETUP_* 
argument being the handler offset, have it be the index of the just-added 
blocktree entry.

* Replace f_blockstack and f_iblock with 'int f_iblktree', and change 
PyFrame_BlockSetup() to set this equal to the SETUP_* argument, and 
PyFrame_BlockPop() to use this as an index into the code's co_blktree to 
retrieve the needed values.  PyFrame_BlockPop() would then set f_iblktree 
equal to the "popped" block's 'b_prev' member, thus "popping" the block 
from this virtual stack.

(Note, by the way, that the blocktree could actually be created as a 
post-processing step of the current compilation process, by a loop that 
scans the bytecode and tracks the current stack and blockstack levels, and 
then replaces the SETUP_* opcodes' arguments.  This might be a simpler 
option than trying to change the compiler to do it along the way.)

Can anybody see any flaws in this concept?  As far as I can tell it just 
generates all possible block stack states at compile time, but doesn't 
change block semantics in the least, and it scarcely touches the eval 
loop.  It seems like it could drop the size of frames enough to let them 
use pymalloc instead of the OS malloc, at the cost of a 16 bytes per block 
increase in the size of code objects.  (And of course the necessary changes 
to 'marshal' and 'dis' as well as the compiler and eval loop.)

(More precisely, frames whose f_nlocals + f_stacksize is 40 or less, would 
be 256 bytes or less, and therefore pymalloc-able.  However, this should 
cover all but the most complex functions.)