[Python-Dev] Store x Load x --> DupStore

Phillip J. Eby pje at telecommunity.com
Sun Feb 20 21:22:00 CET 2005

At 06:38 PM 2/20/05 +0000, Michael Hudson wrote:
> >> It folds the two steps into a new opcode.  In the case of
> >> store_name/load_name, it saves one three byte instruction, a trip around
> >> the eval-loop, two stack mutations, a incref/decref pair, a dictionary
> >> lookup, and an error check (for the lookup).  While it acts like a dup
> >> followed by a store, it is implemented more simply as a store that
> >> doesn't pop the stack.  The transformation is broadly applicable and
> >> occurs thousands of times in the standard library and test suite.
>I'm still a little curious as to what code creates such opcodes...

A simple STORE+LOAD case:

 >>> dis.dis(compile("x=1; y=x*2","?","exec"))
   1           0 LOAD_CONST               0 (1)
               3 STORE_NAME               0 (x)
               6 LOAD_NAME                0 (x)
               9 LOAD_CONST               1 (2)
              12 BINARY_MULTIPLY
              13 STORE_NAME               1 (y)
              16 LOAD_CONST               2 (None)
              19 RETURN_VALUE

And a simple DUP+STORE case:

 >>> dis.dis(compile("x=y=1","?","exec"))
   1           0 LOAD_CONST               0 (1)
               3 DUP_TOP
               4 STORE_NAME               0 (x)
               7 STORE_NAME               1 (y)
              10 LOAD_CONST               1 (None)
              13 RETURN_VALUE

Of course, I'm not sure how commonly this sort of code occurs in places 
where it makes a difference to anything.  Function call overhead continues 
to be Python's most damaging performance issue, because it makes it 
expensive to use abstraction.

Here's a thought.  Suppose we split frames into an "object" part and a 
"struct" part, with the object part being just a pointer to the struct 
part, and a flag indicating whether the struct part is stack-allocated or 
malloc'ed.  This would let us stack-allocate the bulk of the frame 
structure, but still have a frame "object" to pass around.  On exit from 
the C routine that stack-allocated the frame struct, we check to see if the 
frame object has a refcount>1, and if so, malloc a permanent home for the 
frame struct and update the frame object's struct pointer and flag.

In this way, frame allocation overhead could be reduced to the cost of an 
alloca, or just incorporated into the stack frame setup of the C routine 
itself, allowing the entire struct to be treated as "local variables" from 
a C perspective (which might benefit performance on architectures that 
reserve a register for local variable access).

Of course, this would slow down exception handling and other scenarios that 
result in extra references to a frame object, but if the OS malloc is the 
slow part of frame allocation (frame objects are too large for pymalloc), 
then perhaps it would be a net win.  On the other hand, this approach would 
definitely use more stack space per calling level.

More information about the Python-Dev mailing list