[Python-Dev] Store x Load x --> DupStore
Phillip J. Eby
pje at telecommunity.com
Sun Feb 20 21:22:00 CET 2005
At 06:38 PM 2/20/05 +0000, Michael Hudson wrote:
> >> It folds the two steps into a new opcode. In the case of
> >> store_name/load_name, it saves one three byte instruction, a trip around
> >> the eval-loop, two stack mutations, a incref/decref pair, a dictionary
> >> lookup, and an error check (for the lookup). While it acts like a dup
> >> followed by a store, it is implemented more simply as a store that
> >> doesn't pop the stack. The transformation is broadly applicable and
> >> occurs thousands of times in the standard library and test suite.
>I'm still a little curious as to what code creates such opcodes...
A simple STORE+LOAD case:
>>> dis.dis(compile("x=1; y=x*2","?","exec"))
1 0 LOAD_CONST 0 (1)
3 STORE_NAME 0 (x)
6 LOAD_NAME 0 (x)
9 LOAD_CONST 1 (2)
13 STORE_NAME 1 (y)
16 LOAD_CONST 2 (None)
And a simple DUP+STORE case:
1 0 LOAD_CONST 0 (1)
4 STORE_NAME 0 (x)
7 STORE_NAME 1 (y)
10 LOAD_CONST 1 (None)
Of course, I'm not sure how commonly this sort of code occurs in places
where it makes a difference to anything. Function call overhead continues
to be Python's most damaging performance issue, because it makes it
expensive to use abstraction.
Here's a thought. Suppose we split frames into an "object" part and a
"struct" part, with the object part being just a pointer to the struct
part, and a flag indicating whether the struct part is stack-allocated or
malloc'ed. This would let us stack-allocate the bulk of the frame
structure, but still have a frame "object" to pass around. On exit from
the C routine that stack-allocated the frame struct, we check to see if the
frame object has a refcount>1, and if so, malloc a permanent home for the
frame struct and update the frame object's struct pointer and flag.
In this way, frame allocation overhead could be reduced to the cost of an
alloca, or just incorporated into the stack frame setup of the C routine
itself, allowing the entire struct to be treated as "local variables" from
a C perspective (which might benefit performance on architectures that
reserve a register for local variable access).
Of course, this would slow down exception handling and other scenarios that
result in extra references to a frame object, but if the OS malloc is the
slow part of frame allocation (frame objects are too large for pymalloc),
then perhaps it would be a net win. On the other hand, this approach would
definitely use more stack space per calling level.
More information about the Python-Dev