[Python-Dev] Micro-optimizations by adding special-case bytecodes?

Wed May 24 16:14:18 EDT 2017

Hi Ben,

On 24/05/17 19:07, Ben Hoyt wrote:
> I'm not proposing to do this yet, as I'd need to benchmark to see how 
> much of a gain (if any) it would amount to, but I'm just wondering if 
> there's any previous work on this kind of thing. Or, if not, any other 
> thoughts before I try it?

This is exactly what I looked into just over a year ago. As Stephane 
suggests, I did this by adding new opcodes that the peephole optimizer 
generated and the interpreter loop understood (but the compiler itself 
did not need to know anything about these new opcodes, so that makes 
things much easier).

Adding new opcodes like this at the time wasn't straightforward because 
of issues with the build process (see this thread: 
https://mail.python.org/pipermail/python-dev/2015-December/142600.html - 
it starts out as a question about the bytecode format but ended up with 
some very useful information on the build process).

Note that since that thread, a couple of things have changed - the 
bytecode is now wordcode so some of my original questions aren't 
relevant, and some of the things I had a problem with in the build 
system are now auto-generated with a new 'make' target. So it _should_ 
be easier now than it was then.

In terms of the results I got once I had things building and running, I 
didn't manage to find any particular magic bullets that gave me a 
significant enough speedup. Perhaps I just didn't pick the right opcode 
sequences or the right test cases (though what I was trying to do was 
quite successful in terms of doing things like replacing 
branches-to-RETURN into a single RETURN - so LOAD_CONST/RETURN_VALUE 
became RETURN_CONST and therefore if the target of an unconditional 
branch was to a RETURN_CONST op, the branch op could be replaced by the 
RETURN_CONST).

I figured that one thing every function or method needs to do is return, 
so I tried to make that more efficient. I only had two weeks to spend on 
it though ...

I was trying to do that by avoiding trips-around-the-interpreter-loop as 
that was historically something that would give speedups. However, with 
the new computed-goto version of the interpreter I came to the 
conclusion that it's not at important as it used to be. I was building 
with gcc though and what I *didn't* do was disable the computed-goto 
code (it's controlled by a #define) to see if my changes improved 
performance on platforms that can't use it.

I have other opcode sequences that I identified that might be useful to 
look at further.

I didn't (and still don't) have enough bandwidth to *drive* something 
like this through though, but if you want to do that I'd be more than 
happy to be kept in the loop on what you're doing and can possibly find 
time to write some code too.

Regards, E.