Python and the need for speed
python at lucidity.plus.com
Tue Apr 18 05:32:45 EDT 2017
On 13/04/17 18:50, MRAB wrote:
> On 2017-04-13 09:08, Steven D'Aprano wrote:
>> On Wed, 12 Apr 2017 16:30:38 -0700, bart4858 wrote:
>> Is it possible to skip the STORE_NAME op-code? If you knew *for sure*
>> that the target (x) was a mutable object which implemented += using an
>> place mutation, then you could, but the only built-in where that applies
>> is list so even if you could guarantee x was a list, it hardly seems
>> worth the bother.
> If the reference to be stored by STORE_NAME is the same as the reference
> returned by LOAD_NAME, then STORE_NAME could be omitted.
> That would just mean remembering that address.
When considering special-casing this opcode sequence, remember that
in-place operations can be performed on anonymous objects (i.e., those
referenced by a collection and not bound directly to a namespace):
>>> import dis
>>> dis.dis(compile("x = [0, 1, 2]; x += 1;", "", "single"))
1 0 LOAD_CONST 0 (0)
3 LOAD_CONST 1 (1)
6 LOAD_CONST 2 (2)
9 BUILD_LIST 3
12 STORE_NAME 0 (x)
15 LOAD_NAME 0 (x)
18 LOAD_CONST 1 (1)
23 LOAD_CONST 1 (1)
29 LOAD_CONST 3 (None)
So in this case, the STORE_SUBSCR does the re-binding, but it is
separated from the INPLACE_ADD by another opcode.
I'm not saying it's impossible to fold the re-binding into a (set of)
special new opcode(s), but I am saying it's more complex than at first
FWIW, I spent some time about a year ago looking at things like this
(small improvements to the peephole optimizer which allowed certain very
common sequences to be folded into a (new) opcode which in turn allowed
other optimizations to avoid branching). The changes worked, but didn't
actually improve performance significantly in my tests (which is why I
ended up not bothering to propose anything).
I remember back in the day (circa 1.5.2?) that
trips-around-the-interpreter-loop were significant and avoiding them
could give wins. However, in the current CPython interpreter, the
improvements over the original huge switch() to dispatch the bytecodes
to the correct handler appear to have made this type of optimization
less effective. That was my conclusion at the time, anyway - I only had
about a week to experiment with it.
More information about the Python-list