[Python-ideas] Exposing flat bytecode representation to optimizers

Fri Feb 5 16:03:20 EST 2016

On Friday, February 5, 2016 12:36 PM, Serhiy Storchaka <storchaka at gmail.com> wrote:

> On 05.02.16 21:58, Andrew Barnert via Python-ideas wrote:
>>  The biggest pain with dealing with the peephole optimizer is that it 
>> happens after all the complicated flattening and fixup[^1] the compiler does, 
>> which means you have to hack up the jump targets as you go along. The future 
>> bytecode optimizers that PEP 511 enables will have the same headache.
>> 
>>  But this isn't actually necessary. The optimizer could work on a flat 
>> array of instructions[^2] instead of an array of bytes, with relocatable jump 
>> targets instead of fixed byte offsets, and then the compiler could do the fixup 
>> _after_ the optimization.[^3]
>
> LGTM. I did have the same idea when added specialized 8-bit opcodes.
> 
> Some optimization (like constant folding) it is worth to move yet one 
> step earlier, to AST.

Definitely. And if we could only move _all_ optimization earlier, that would make this change unnecessary.

But, sadly, I don't think we can; there are some steps the peephole optimizer does that have to stay bytecode-based, and post-PEP 511, there will always be new ideas people come up with that also have to stay bytecode-based.[^1]

> Other idea - instead of EXTENDED_ARG have two sets of instructions: 
> short 16-bit with 8-bit arg, and long 32-bit with 24-bit arg. For 
> simplicity initially only long instructions are emitted (if it makes 
> sense). After optimization the code is packed using short instructions 
> if possible (now we have all needed information and can do this in one 
> pass). At the same time NOPs are removed. For later stages we easy can 
> unpack the code back to long form.

I like that. Unpacking/packing is a lot simpler, and less to explain, than exposing an array of structs, and we can even expose those functions to Python for bytecode decorators (maybe hidden in the dis module?), which would be a lot simpler than everything byteplay does, and and 90% of the time it would be all you need.

I also like that you can cram NOP removal into the packing step, so optimizers that just remove code don't have to actually remove or renumber anything; only those that insert or move code have to worry about that.

The CPU cost of going from list of arrays of instructions to unpacked opcode array to packed opcode array will be a little higher than going straight from list of arrays of instructions to packed opcode array, but I'll bet it will be negligible.

The only downside I can think of is that arguments are now restricted to 2**31 (or signed 32-bit), and this would restrict them to 2**23. Has anyone ever had anything close to 4 million instructions (or consts, or whatever), in a code object? I doubt it.

But why not make it even simpler and just have all unpacked instructions be 32-bit? Sure, that means unpacked code arrays are bigger, but it's not like the optimizers are going to be looping over the same array a zillion times and worrying about cache spill (or that the optimizers will be in a hotspot in the first place). Then we've just got an int32*, and a jump to offset 76 is a jump to the 4 bytes at bytecode[76] (or, in Python, where we may still have to use a bytes object, it's at worst a jump to bytecode[76<<2]).

Anyway, this seems doable, and worth doing, independent of the wordcode idea, the packed-args idea you're working on, AST-ifying most of the peephole optimizer, PEP 511, or anything else, so if no comes up with any objections, I'll add a tracker issue for it and work on this before going back to wordcode this weekend.

  [^1]: For example, Victor's const optimization can only be done at the AST level by adding a new kind of AST node, and compiler support for it. That's obviously not something a plug-in optimizer could do, so it would instead have to be written as a bytecode optimizer.