Python and the need for speed

Tue Apr 18 05:32:45 EDT 2017

On 13/04/17 18:50, MRAB wrote:
> On 2017-04-13 09:08, Steven D'Aprano wrote:
>> On Wed, 12 Apr 2017 16:30:38 -0700, bart4858 wrote:
>> Is it possible to skip the STORE_NAME op-code? If you knew *for sure*
>> that the target (x) was a mutable object which implemented += using an
>> in-
>> place mutation, then you could, but the only built-in where that applies
>> is list so even if you could guarantee x was a list, it hardly seems
>> worth the bother.
>>
> If the reference to be stored by STORE_NAME is the same as the reference
> returned by LOAD_NAME, then STORE_NAME could be omitted.
>
> That would just mean remembering that address.

When considering special-casing this opcode sequence, remember that 
in-place operations can be performed on anonymous objects (i.e., those 
referenced by a collection and not bound directly to a namespace):

 >>> import dis
 >>> dis.dis(compile("x = [0, 1, 2]; x[1] += 1;", "", "single"))
   1           0 LOAD_CONST               0 (0)
               3 LOAD_CONST               1 (1)
               6 LOAD_CONST               2 (2)
               9 BUILD_LIST               3
              12 STORE_NAME               0 (x)
              15 LOAD_NAME                0 (x)
              18 LOAD_CONST               1 (1)
              21 DUP_TOP_TWO
              22 BINARY_SUBSCR
              23 LOAD_CONST               1 (1)
              26 INPLACE_ADD
              27 ROT_THREE
              28 STORE_SUBSCR
              29 LOAD_CONST               3 (None)
              32 RETURN_VALUE

So in this case, the STORE_SUBSCR does the re-binding, but it is 
separated from the INPLACE_ADD by another opcode.

I'm not saying it's impossible to fold the re-binding into a (set of) 
special new opcode(s), but I am saying it's more complex than at first 
it appears.

FWIW, I spent some time about a year ago looking at things like this 
(small improvements to the peephole optimizer which allowed certain very 
common sequences to be folded into a (new) opcode which in turn allowed 
other optimizations to avoid branching). The changes worked, but didn't 
actually improve performance significantly in my tests (which is why I 
ended up not bothering to propose anything).

I remember back in the day (circa 1.5.2?) that 
trips-around-the-interpreter-loop were significant and avoiding them 
could give wins. However, in the current CPython interpreter, the 
improvements over the original huge switch() to dispatch the bytecodes 
to the correct handler appear to have made this type of optimization 
less effective. That was my conclusion at the time, anyway - I only had 
about a week to experiment with it.

E.