Exposing flat bytecode representation to optimizers
![](https://secure.gravatar.com/avatar/7e41acaa8f6a0e0f5a7c645e93add55a.jpg?s=120&d=mm&r=g)
The biggest pain with dealing with the peephole optimizer is that it happens after all the complicated flattening and fixup[^1] the compiler does, which means you have to hack up the jump targets as you go along. The future bytecode optimizers that PEP 511 enables will have the same headache. But this isn't actually necessary. The optimizer could work on a flat array of instructions[^2] instead of an array of bytes, with relocatable jump targets instead of fixed byte offsets, and then the compiler could do the fixup _after_ the optimization.[^3] It would break the optimizer APIs, but `PyCode_Optimize` isn't public, and the API proposed by PEP 511 is public, but that PEP isn't even finalized, much less accepted yet. I don't think we need to expose the intermediate representation any farther along than the `PyCode_Optimize` step.[^4] Just moving the optimize one step earlier in the chain solves more than enough to be worth it. [^1]: If you think numbering the offsets and emitting the jump targets is easy: Every time you fix up a jump, that may require adding an `EXTENDED_ARG`, which means you have to redo any fixups that cross the the current instruction. The compiler does this by just looping until it hasn't added any more `EXTENDED_ARG`s. [^2]: In case anyone's thinking that wordcode would solve this problem, it doesn't. The `EXTENDED_ARG` jump targets are a much bigger hassle than the 1-or-3-byte-ops, and wordcode not only can't eliminate those, but makes `EXTENDED_ARG` more common. [^3]: The compiler doesn't actually have exactly what the optimizers would want, but it's pretty close: it has a linked list of block objects, each of which has an array of instruction objects, with jump targets being pointers to blocks. That's actually even better to work with, but too complicated to expose to optimizers. Flattening it would be trivial. Or, if that's too expensive, we could do something almost as simple and much cheaper: convert it in-line to a deque-like linked list of arrays, with jump targets being indices or pointers into that. Or we could just expose the list of blocks as-is, as an opaque thing with a mutable-deque-of-instructions API around it. [^4]: Later stages--import hooks, optimizing decorators, etc.--have the same pain as the peephole optimizer, but they also have code objects that contain other code objects, and they can be written in Python, and so on, so the same solution isn't feasible there. Of course we could add a function to convert bytecode back to a list of instructions, in some format that can be exposed to Python (in fact, `dis` already does 90% of that), and then another one to convert that back to bytecode and redo the fixup (which is basically the third-party `byteplay` module). But that's almost certainly overkill. (If we wanted that, I'd rather add `byteplay` to the stdlib, port it and `dis` to C, and expose a C API for them. And then we could use that for everything, including the peephole optimizer and PEP 511 optimizers. Which would all be really cool, but I think it's more work than we want to do, and I don't know if we'd actually want something like `byteplay` builtin even if it were easy...)
![](https://secure.gravatar.com/avatar/4c01705256aa2160c1354790e8c154db.jpg?s=120&d=mm&r=g)
On 05.02.16 21:58, Andrew Barnert via Python-ideas wrote:
The biggest pain with dealing with the peephole optimizer is that it happens after all the complicated flattening and fixup[^1] the compiler does, which means you have to hack up the jump targets as you go along. The future bytecode optimizers that PEP 511 enables will have the same headache.
But this isn't actually necessary. The optimizer could work on a flat array of instructions[^2] instead of an array of bytes, with relocatable jump targets instead of fixed byte offsets, and then the compiler could do the fixup _after_ the optimization.[^3]
It would break the optimizer APIs, but `PyCode_Optimize` isn't public, and the API proposed by PEP 511 is public, but that PEP isn't even finalized, much less accepted yet.
I don't think we need to expose the intermediate representation any farther along than the `PyCode_Optimize` step.[^4] Just moving the optimize one step earlier in the chain solves more than enough to be worth it.
LGTM. I did have the same idea when added specialized 8-bit opcodes. Some optimization (like constant folding) it is worth to move yet one step earlier, to AST. Other idea - instead of EXTENDED_ARG have two sets of instructions: short 16-bit with 8-bit arg, and long 32-bit with 24-bit arg. For simplicity initially only long instructions are emitted (if it makes sense). After optimization the code is packed using short instructions if possible (now we have all needed information and can do this in one pass). At the same time NOPs are removed. For later stages we easy can unpack the code back to long form.
![](https://secure.gravatar.com/avatar/7e41acaa8f6a0e0f5a7c645e93add55a.jpg?s=120&d=mm&r=g)
On Friday, February 5, 2016 12:36 PM, Serhiy Storchaka <storchaka@gmail.com> wrote:
On 05.02.16 21:58, Andrew Barnert via Python-ideas wrote:
The biggest pain with dealing with the peephole optimizer is that it happens after all the complicated flattening and fixup[^1] the compiler does, which means you have to hack up the jump targets as you go along. The future bytecode optimizers that PEP 511 enables will have the same headache.
But this isn't actually necessary. The optimizer could work on a flat array of instructions[^2] instead of an array of bytes, with relocatable jump targets instead of fixed byte offsets, and then the compiler could do the fixup _after_ the optimization.[^3]
LGTM. I did have the same idea when added specialized 8-bit opcodes.
Some optimization (like constant folding) it is worth to move yet one step earlier, to AST.
Definitely. And if we could only move _all_ optimization earlier, that would make this change unnecessary. But, sadly, I don't think we can; there are some steps the peephole optimizer does that have to stay bytecode-based, and post-PEP 511, there will always be new ideas people come up with that also have to stay bytecode-based.[^1]
Other idea - instead of EXTENDED_ARG have two sets of instructions: short 16-bit with 8-bit arg, and long 32-bit with 24-bit arg. For simplicity initially only long instructions are emitted (if it makes sense). After optimization the code is packed using short instructions if possible (now we have all needed information and can do this in one pass). At the same time NOPs are removed. For later stages we easy can unpack the code back to long form.
I like that. Unpacking/packing is a lot simpler, and less to explain, than exposing an array of structs, and we can even expose those functions to Python for bytecode decorators (maybe hidden in the dis module?), which would be a lot simpler than everything byteplay does, and and 90% of the time it would be all you need. I also like that you can cram NOP removal into the packing step, so optimizers that just remove code don't have to actually remove or renumber anything; only those that insert or move code have to worry about that. The CPU cost of going from list of arrays of instructions to unpacked opcode array to packed opcode array will be a little higher than going straight from list of arrays of instructions to packed opcode array, but I'll bet it will be negligible. The only downside I can think of is that arguments are now restricted to 2**31 (or signed 32-bit), and this would restrict them to 2**23. Has anyone ever had anything close to 4 million instructions (or consts, or whatever), in a code object? I doubt it. But why not make it even simpler and just have all unpacked instructions be 32-bit? Sure, that means unpacked code arrays are bigger, but it's not like the optimizers are going to be looping over the same array a zillion times and worrying about cache spill (or that the optimizers will be in a hotspot in the first place). Then we've just got an int32*, and a jump to offset 76 is a jump to the 4 bytes at bytecode[76] (or, in Python, where we may still have to use a bytes object, it's at worst a jump to bytecode[76<<2]). Anyway, this seems doable, and worth doing, independent of the wordcode idea, the packed-args idea you're working on, AST-ifying most of the peephole optimizer, PEP 511, or anything else, so if no comes up with any objections, I'll add a tracker issue for it and work on this before going back to wordcode this weekend. [^1]: For example, Victor's const optimization can only be done at the AST level by adding a new kind of AST node, and compiler support for it. That's obviously not something a plug-in optimizer could do, so it would instead have to be written as a bytecode optimizer.
![](https://secure.gravatar.com/avatar/4c01705256aa2160c1354790e8c154db.jpg?s=120&d=mm&r=g)
On 05.02.16 23:03, Andrew Barnert via Python-ideas wrote:
But why not make it even simpler and just have all unpacked instructions be 32-bit? Sure, that means unpacked code arrays are bigger, but it's not like the optimizers are going to be looping over the same array a zillion times and worrying about cache spill (or that the optimizers will be in a hotspot in the first place). Then we've just got an int32*, and a jump to offset 76 is a jump to the 4 bytes at bytecode[76] (or, in Python, where we may still have to use a bytes object, it's at worst a jump to bytecode[76<<2]).
My idea was to not add new opcodes for unpacked form and keep unpacked form executable. Thus we have 16-bit LOAD_CONST and 32-bit LONG_LOAD_CONST, but only 16-bit POP_TOP and COMPARE_OP since POP_TOP has no argument and the argument of COMPARE_OP always fits in 8 bit. Unpacked form always uses long variant if it exists. Alternative variant - always use 32-bit instructions and don't pack them to 8 or 16 bits. This will increase bytecode size by 4/2.73 = 1.5 times, but will make some parts of compiler, optimizer and interpreter simpler.
![](https://secure.gravatar.com/avatar/5ce43469c0402a7db8d0cf86fa49da5a.jpg?s=120&d=mm&r=g)
On 2016-02-06 11:05:06, "Serhiy Storchaka" <storchaka@gmail.com> wrote:
On 05.02.16 23:03, Andrew Barnert via Python-ideas wrote:
But why not make it even simpler and just have all unpacked instructions be 32-bit? Sure, that means unpacked code arrays are bigger, but it's not like the optimizers are going to be looping over the same array a zillion times and worrying about cache spill (or that the optimizers will be in a hotspot in the first place). Then we've just got an int32*, and a jump to offset 76 is a jump to the 4 bytes at bytecode[76] (or, in Python, where we may still have to use a bytes object, it's at worst a jump to bytecode[76<<2]).
My idea was to not add new opcodes for unpacked form and keep unpacked form executable. Thus we have 16-bit LOAD_CONST and 32-bit LONG_LOAD_CONST, but only 16-bit POP_TOP and COMPARE_OP since POP_TOP has no argument and the argument of COMPARE_OP always fits in 8 bit. Unpacked form always uses long variant if it exists.
Alternative variant - always use 32-bit instructions and don't pack them to 8 or 16 bits. This will increase bytecode size by 4/2.73 = 1.5 times, but will make some parts of compiler, optimizer and interpreter simpler.
If, at some point, we find that 32 bits aren't enough (because we're starting to get code objects containing more than 4 million instructions), we could then add a 'wide' form with 64-bit instructions. Just a thought...
![](https://secure.gravatar.com/avatar/61a537f7b31ecf682e3269ea04056e94.jpg?s=120&d=mm&r=g)
On 2016-02-05 3:35 PM, Serhiy Storchaka wrote:
I don't think we need to expose the intermediate representation any farther along than the `PyCode_Optimize` step.[^4] Just moving the optimize one step earlier in the chain solves more than enough to be worth it.
LGTM. I did have the same idea when added specialized 8-bit opcodes.
Agree.
Some optimization (like constant folding) it is worth to move yet one step earlier, to AST.
Other idea - instead of EXTENDED_ARG have two sets of instructions: short 16-bit with 8-bit arg, and long 32-bit with 24-bit arg. For simplicity initially only long instructions are emitted (if it makes sense). After optimization the code is packed using short instructions if possible (now we have all needed information and can do this in one pass). At the same time NOPs are removed. For later stages we easy can unpack the code back to long form.
I also had this idea (don't know if it's good or not): 1. have 16 bits per opcode 2. first 8 bits encode the opcode number 3. next 7 bits encode the arg (most args don't need more than 7 bits anyways) 4. if the 16th bit is 1 then the next opcode is the EXTENDED_ARG We can then encode EXTENDED_ARG to follow the same format -- first 15 bits are for the arg, the last one is to add one more EXTENDED_ARG. Yury
![](https://secure.gravatar.com/avatar/7e41acaa8f6a0e0f5a7c645e93add55a.jpg?s=120&d=mm&r=g)
On Friday, February 5, 2016 2:26 PM, Yury Selivanov <yselivanov.ml@gmail.com> wrote:
I also had this idea (don't know if it's good or not):
1. have 16 bits per opcode 2. first 8 bits encode the opcode number 3. next 7 bits encode the arg (most args don't need more than 7 bits anyways) 4. if the 16th bit is 1 then the next opcode is the EXTENDED_ARG
Why? The advantage of Serhiy's idea (or my original one) is that it makes it much easier to write bytecode processors (like the peephole optimizer) when everything is fixed width, because jump target and lnotab fixups become trivial. Your idea leaves jump target and lnotab fixups just as hard as they are today, so we get the cost of having to transform back and forth between two representations, without any benefit. If you were instead suggesting that as an alternative to the version of wordcode I proposed and prototyped in the other thread, to be actually used by the interpreter, then there _is_ a small benefit to your version: arguments in range [2**17, 2**23) need 4 bytes instead of 6. But I think the cost of having arguments in range [2**7, 2**8) take 4 bytes instead of 2, and the extra complexity and CPU cost in the fetch and peek at the core of the eval loop, and being a bigger change from today, make it a lot less attractive.
![](https://secure.gravatar.com/avatar/61a537f7b31ecf682e3269ea04056e94.jpg?s=120&d=mm&r=g)
On 2016-02-05 6:14 PM, Andrew Barnert wrote:
On Friday, February 5, 2016 2:26 PM, Yury Selivanov <yselivanov.ml@gmail.com> wrote:
I also had this idea (don't know if it's good or not):
1. have 16 bits per opcode 2. first 8 bits encode the opcode number 3. next 7 bits encode the arg (most args don't need more than 7 bits anyways) 4. if the 16th bit is 1 then the next opcode is the EXTENDED_ARG
Why? The advantage of Serhiy's idea (or my original one) is that it makes it much easier to write bytecode processors (like the peephole optimizer) when everything is fixed width, because jump target and lnotab fixups become trivial. Your idea leaves jump target and lnotab fixups just as hard as they are today, so we get the cost of having to transform back and forth between two representations, without any benefit.
Yes, this was related to how Serhiy original proposal on how we can pack opcodes. It's completely unrelated to the peephole optimizer. Sorry for the confusion.
If you were instead suggesting that as an alternative to the version of wordcode I proposed and prototyped in the other thread, to be actually used by the interpreter, then there _is_ a small benefit to your version: arguments in range [2**17, 2**23) need 4 bytes instead of 6. But I think the cost of having arguments in range [2**7, 2**8) take 4 bytes instead of 2, and the extra complexity and CPU cost in the fetch and peek at the core of the eval loop, and being a bigger change from today, make it a lot less attractive.
I'm sorry, I can't parse what you said there... My idea is based on the fact that most opcode arguments are in [0, 127] range. EXTENDED_ARG isn't used too often, but when it is used, it will be packed more efficiently in my scheme. Yury
![](https://secure.gravatar.com/avatar/7e41acaa8f6a0e0f5a7c645e93add55a.jpg?s=120&d=mm&r=g)
On Feb 5, 2016, at 15:25, Yury Selivanov <yselivanov.ml@gmail.com> wrote:
On 2016-02-05 6:14 PM, Andrew Barnert wrote:
On Friday, February 5, 2016 2:26 PM, Yury Selivanov <yselivanov.ml@gmail.com> wrote:
I also had this idea (don't know if it's good or not):
1. have 16 bits per opcode 2. first 8 bits encode the opcode number 3. next 7 bits encode the arg (most args don't need more than 7 bits anyways) 4. if the 16th bit is 1 then the next opcode is the EXTENDED_ARG
Why? The advantage of Serhiy's idea (or my original one) is that it makes it much easier to write bytecode processors (like the peephole optimizer) when everything is fixed width, because jump target and lnotab fixups become trivial. Your idea leaves jump target and lnotab fixups just as hard as they are today, so we get the cost of having to transform back and forth between two representations, without any benefit.
Yes, this was related to how Serhiy original proposal on how we can pack opcodes. It's completely unrelated to the peephole optimizer. Sorry for the confusion.
OK, we're on the wrong thread, then, but no big deal.
If you were instead suggesting that as an alternative to the version of wordcode I proposed and prototyped in the other thread, to be actually used by the interpreter, then there _is_ a small benefit to your version: arguments in range [2**17, 2**23) need 4 bytes instead of 6. But I think the cost of having arguments in range [2**7, 2**8) take 4 bytes instead of 2, and the extra complexity and CPU cost in the fetch and peek at the core of the eval loop, and being a bigger change from today, make it a lot less attractive.
I'm sorry, I can't parse what you said there...
OK, let's compare my idea, from the thread you meant to reply to, and your idea. I use 16-bit opcodes, with 8 bits for the op and 8 for the argument. I use the existing EXTENDED_ARG mechanism to handle values that don't fit in 8 bits (each 16-bit EXTENDED_ARGS gets you another 8 bits). You use 16-bit opcodes, with only 8 bits for the op, 7 for the argument, and 1 for a continuation bit. By using the continuation bit instead of existing EXTENDED_ARGs, you can fit 22 bits worth of argument in 4 bytes, while I can only fit 16 bits worth. So, your idea uses fewer bytes than mine (4 instead of 6) for any argument in range [65536, 4194303]. But it uses more bytes than mine (4 instead of 2) for any argument in range [128, 255]. Although [128, 255] is a smaller range than [65536, 4194303], it's much more common in arguments, so I think your idea will save less space than mine overall. Your idea is also more complicated conceptually, would require more changes to the interpreter (and to third-party code like byteplay or decompyle), and will probably be slower in the eval loop, which also all make it less attractive. As for the idea Serhiy is working on, I suspect it will save more space than either yours or mine, but mine might have performance and/or simplicity benefits that make it a better choice. (Hopefully we'll both get far enough that we can prototype it and see.) He's stealing extra bits from the opcode byte to allow skipping the argument entirely in many cases (e.g., LOAD_CONST_0 would be a no-argument opcode that does the same thing as LOAD_CONST with argument 0). So many common opcodes will be only a single byte.
![](https://secure.gravatar.com/avatar/61a537f7b31ecf682e3269ea04056e94.jpg?s=120&d=mm&r=g)
On 2016-02-05 6:58 PM, Andrew Barnert wrote:
On Feb 5, 2016, at 15:25, Yury Selivanov <yselivanov.ml@gmail.com> wrote:
On 2016-02-05 6:14 PM, Andrew Barnert wrote:
On Friday, February 5, 2016 2:26 PM, Yury Selivanov <yselivanov.ml@gmail.com> wrote:
I also had this idea (don't know if it's good or not):
1. have 16 bits per opcode 2. first 8 bits encode the opcode number 3. next 7 bits encode the arg (most args don't need more than 7 bits anyways) 4. if the 16th bit is 1 then the next opcode is the EXTENDED_ARG Why? The advantage of Serhiy's idea (or my original one) is that it makes it much easier to write bytecode processors (like the peephole optimizer) when everything is fixed width, because jump target and lnotab fixups become trivial. Your idea leaves jump target and lnotab fixups just as hard as they are today, so we get the cost of having to transform back and forth between two representations, without any benefit. Yes, this was related to how Serhiy original proposal on how we can pack opcodes. It's completely unrelated to the peephole optimizer. Sorry for the confusion.
OK, we're on the wrong thread, then, but no big deal.
If you were instead suggesting that as an alternative to the version of wordcode I proposed and prototyped in the other thread, to be actually used by the interpreter, then there _is_ a small benefit to your version: arguments in range [2**17, 2**23) need 4 bytes instead of 6. But I think the cost of having arguments in range [2**7, 2**8) take 4 bytes instead of 2, and the extra complexity and CPU cost in the fetch and peek at the core of the eval loop, and being a bigger change from today, make it a lot less attractive. I'm sorry, I can't parse what you said there... OK, let's compare my idea, from the thread you meant to reply to, and your idea.
I use 16-bit opcodes, with 8 bits for the op and 8 for the argument. I use the existing EXTENDED_ARG mechanism to handle values that don't fit in 8 bits (each 16-bit EXTENDED_ARGS gets you another 8 bits).
You use 16-bit opcodes, with only 8 bits for the op, 7 for the argument, and 1 for a continuation bit. By using the continuation bit instead of existing EXTENDED_ARGs, you can fit 22 bits worth of argument in 4 bytes, while I can only fit 16 bits worth.
So, your idea uses fewer bytes than mine (4 instead of 6) for any argument in range [65536, 4194303]. But it uses more bytes than mine (4 instead of 2) for any argument in range [128, 255]. Although [128, 255] is a smaller range than [65536, 4194303], it's much more common in arguments, so I think your idea will save less space than mine overall.
Your idea is also more complicated conceptually, would require more changes to the interpreter (and to third-party code like byteplay or decompyle), and will probably be slower in the eval loop, which also all make it less attractive.
Got it. Thanks for clarification. I actually thought Serhiy is going to work on 16bit opcodes, but it seems that he wants to add specialized opcodes like LOAD_ATTR0, LOAD_ATTR1, etc. Yury
![](https://secure.gravatar.com/avatar/6a63f6b04556912b0763100879d5053d.jpg?s=120&d=mm&r=g)
2016-02-05 21:35 GMT+01:00 Serhiy Storchaka <storchaka@gmail.com>:
LGTM. I did have the same idea when added specialized 8-bit opcodes.
Some optimization (like constant folding) it is worth to move yet one step earlier, to AST.
In fact. In WPython I moved all constant folding to the ASDL.
Other idea - instead of EXTENDED_ARG have two sets of instructions: short 16-bit with 8-bit arg, and long 32-bit with 24-bit arg. For simplicity initially only long instructions are emitted (if it makes sense).
Don't do it with all instructions, but only with the jump ones, which are the only susceptible of such changes. In all other cases you already know the size of the argument, which will not change. Regards, Cesare
![](https://secure.gravatar.com/avatar/5ce43469c0402a7db8d0cf86fa49da5a.jpg?s=120&d=mm&r=g)
On 2016-02-05 19:58:58, "Andrew Barnert via Python-ideas" <python-ideas@python.org> wrote:
The biggest pain with dealing with the peephole optimizer is that it happens after all the complicated flattening and fixup[^1] the compiler does, which means you have to hack up the jump targets as you go along. The future bytecode optimizers that PEP 511 enables will have the same headache.
But this isn't actually necessary. The optimizer could work on a flat array of instructions[^2] instead of an array of bytes, with relocatable jump targets instead of fixed byte offsets, and then the compiler could do the fixup _after_ the optimization.[^3]
It would break the optimizer APIs, but `PyCode_Optimize` isn't public, and the API proposed by PEP 511 is public, but that PEP isn't even finalized, much less accepted yet.
I don't think we need to expose the intermediate representation any farther along than the `PyCode_Optimize` step.[^4] Just moving the optimize one step earlier in the chain solves more than enough to be worth it.
[^1]: If you think numbering the offsets and emitting the jump targets is easy: Every time you fix up a jump, that may require adding an `EXTENDED_ARG`, which means you have to redo any fixups that cross the the current instruction. The compiler does this by just looping until it hasn't added any more `EXTENDED_ARG`s.
[^2]: In case anyone's thinking that wordcode would solve this problem, it doesn't. The `EXTENDED_ARG` jump targets are a much bigger hassle than the 1-or-3-byte-ops, and wordcode not only can't eliminate those, but makes `EXTENDED_ARG` more common.
[^3]: The compiler doesn't actually have exactly what the optimizers would want, but it's pretty close: it has a linked list of block objects, each of which has an array of instruction objects, with jump targets being pointers to blocks. That's actually even better to work with, but too complicated to expose to optimizers. Flattening it would be trivial. Or, if that's too expensive, we could do something almost as simple and much cheaper: convert it in-line to a deque-like linked list of arrays, with jump targets being indices or pointers into that. Or we could just expose the list of blocks as-is, as an opaque thing with a mutable-deque-of-instructions API around it.
[^4]: Later stages--import hooks, optimizing decorators, etc.--have the same pain as the peephole optimizer, but they also have code objects that contain other code objects, and they can be written in Python, and so on, so the same solution isn't feasible there. Of course we could add a function to convert bytecode back to a list of instructions, in some format that can be exposed to Python (in fact, `dis` already does 90% of that), and then another one to convert that back to bytecode and redo the fixup (which is basically the third-party `byteplay` module). But that's almost certainly overkill. (If we wanted that, I'd rather add `byteplay` to the stdlib, port it and `dis` to C, and expose a C API for them. And then we could use that for everything, including the peephole optimizer and PEP 511 optimizers. Which would all be really cool, but I think it's more work than we want to do, and I don't know if we'd actually want something like `byteplay` builtin even if it were easy...)
What I've done in similar situations is have jumps refer to labels, which are also instructions in the code. There would then be a pass to copy the code, recording the position of each label, but not copying it, followed by a pass to change the jumps to refer to their target's position. (The jumps were fixed-size, though.)
![](https://secure.gravatar.com/avatar/7e41acaa8f6a0e0f5a7c645e93add55a.jpg?s=120&d=mm&r=g)
On Friday, February 5, 2016 1:10 PM, MRAB <python@mrabarnett.plus.com> wrote:
On 2016-02-05 19:58:58, "Andrew Barnert via Python-ideas" <python-ideas@python.org> wrote:
The biggest pain with dealing with the peephole optimizer is that it happens after all the complicated flattening and fixup[^1] the compiler does, which means you have to hack up the jump targets as you go along. The future bytecode optimizers that PEP 511 enables will have the same headache.
But this isn't actually necessary. The optimizer could work on a flat array of instructions[^2] instead of an array of bytes, with relocatable jump targets instead of fixed byte offsets, and then the compiler could do the fixup _after_ the optimization.[^3] What I've done in similar situations is have jumps refer to labels, which are also instructions in the code.
There would then be a pass to copy the code, recording the position of each label, but not copying it, followed by a pass to change the jumps to refer to their target's position. (The jumps were fixed-size, though.)
Yes, that's basically what byteplay does, and effectively what the compiler is doing by pointing at blocks instead of instructions (because Python has no way to jump to anything that isn't the start or end of a block). But that still doesn't make the fixups trivial, unless either (a) you never insert, delete, or move code so the jumps never change size, or (b) you're willing to punt and revert optimizing any time you push an arg into EXTENDED_ARG territory (which is reasonable for experimentation, but not for something in the core, or that you intend to share and deploy widely). Making each optimizer do the same work seems kind of silly. Many of them will punt on uncommon (but not _that_ rare) cases--or, worse, will get those cases wrong and never test them. Plus, with PEP 511, it means if you have 6 optimizers plugged in, you're doing these extra passes 6 times instead of once. We _could_ expose a higher-level label-based representation to the optimizers rather than making them all do it themselves. However, I think the unpack/pack that Serhiy suggested is simpler, and good enough. You _do_ still need fixups if you insert or move code (but not if you delete code, since his suggested pack does NOP removal), but those fixups will be so easy that you won't need labels unless you're doing some really complicated reordering.
![](https://secure.gravatar.com/avatar/daa45563a98419bb1b6b63904ce71f95.jpg?s=120&d=mm&r=g)
2016-02-05 20:58 GMT+01:00 Andrew Barnert via Python-ideas <python-ideas@python.org>:
It would break the optimizer APIs, but `PyCode_Optimize` isn't public
Sadly, PyCode_Optimize() is public and part of the stable ABI. But I'm ok to take the responsability of breaking it if we implement the PEP 511 since we will probably have a much better API to optimize AST and bytecode, and I would prefer that no one directly uses PyCode_Optimize()!
and the API proposed by PEP 511 is public, but that PEP isn't even finalized, much less accepted yet.
To be clear: the PEP 511 is a draft still stuck in python-ideas, it's not ready for a review on python-dev. I still expect changes on the API. I'm very interested by your feedback on the bytecode transformer API :-)
[^3]: The compiler doesn't actually have exactly what the optimizers would want, but it's pretty close: it has a linked list of block objects, each of which has an array of instruction objects, with jump targets being pointers to blocks. That's actually even better to work with, but too complicated to expose to optimizers.
The current implementation of the peephole optimizer has a complex code to update jumps. IMHO it would be simpler if the code would be blocks of instructions. My plan is to implement most implementations at AST level and only optimizer jumps at "bytecode" level. So I don't need much informations on instructions, just a structure easy to manipulate to optimize jumps. For my old "registervm" project, I also wrote my own API to have a higher level structure for bytecode: https://hg.python.org/sandbox/registervm/file/tip/Lib/registervm.py * Block: list of sequential instructions. Blocks are linked together by jumps (and conditional jumps) * Instruction * Instruction argument: Immediate, Immediate8 (8 bits), Label (for jumps), Register (well, something specific to my register-based instructions), Local (local variable) Basically, the code is a list of blocks where a block is a list of instructions + arguments (in registervm, an instruction can have many arguments, not only one). In registervm, the Instruction class stores directly arguments, but that's more an implementation detail ;-) Do you know if byteplay and other projects have a similar design? It would be nice to find the root common structure to be a generic API.
Flattening it would be trivial. Or, if that's too expensive, we could do something almost as simple and much cheaper: convert it in-line to a deque-like linked list of arrays, with jump targets being indices or pointers into that. Or we could just expose the list of blocks as-is, as an opaque thing with a mutable-deque-of-instructions API around it.
For the PEP 511 we can imagine multiple levels for code tranformers: block of instructions, flatten bytecode, AST, etc. The cost of the transformation is to convert internal from CPython objects to Python objects, and then convert the result of the code transformer back to internal CPython objects. For AST, I expect the cost of the double conversions to be non-negligible since AST uses a lot of small objects. Hopefully, no conversion is needed if no code transformer is registered. Victor
![](https://secure.gravatar.com/avatar/7e41acaa8f6a0e0f5a7c645e93add55a.jpg?s=120&d=mm&r=g)
On Feb 5, 2016, at 15:54, Victor Stinner <victor.stinner@gmail.com> wrote:
2016-02-05 20:58 GMT+01:00 Andrew Barnert via Python-ideas <python-ideas@python.org>:
It would break the optimizer APIs, but `PyCode_Optimize` isn't public
Sadly, PyCode_Optimize() is public and part of the stable ABI.
Oops, you're right--I thought it was inside the Py_LIMITED_API check, but it's after the #endif. Anyway, it's not documented anywhere--no mention in the C API docs, and no explanation of what it does in the header (not even the usual one-liner comment). So, I think if you break that in PEP 511, hopefully nobody will complain :)
But I'm ok to take the responsability of breaking it if we implement the PEP 511 since we will probably have a much better API to optimize AST and bytecode, and I would prefer that no one directly uses PyCode_Optimize()!
Great!
and the API proposed by PEP 511 is public, but that PEP isn't even finalized, much less accepted yet.
To be clear: the PEP 511 is a draft still stuck in python-ideas, it's not ready for a review on python-dev. I still expect changes on the API.
I'm very interested by your feedback on the bytecode transformer API :-)
As I mentioned before, I think it probably has to take and return a complete code object, not just the four pieces of the code object that the peephole optimizer happens to need. But meanwhile:
[^3]: The compiler doesn't actually have exactly what the optimizers would want, but it's pretty close: it has a linked list of block objects, each of which has an array of instruction objects, with jump targets being pointers to blocks. That's actually even better to work with, but too complicated to expose to optimizers.
The current implementation of the peephole optimizer has a complex code to update jumps.
Yes, and it has to be complex, even despite punting on all the hard cases. :)
IMHO it would be simpler if the code would be blocks of instructions.
Sure. But I didn't want to expose the compiler's internal blocks concept to the optimizer; that seems way too brittle. We could design and build a new optimizer-friendly concept of blocks that's independent of the compiler's (as it sounds like you've done in the past), or something simpler like MRAB's labels, or some other "macro-assembler"-style feature that nobody's yet thought of in this discussion. If we really need to do that, I think it's seriously worth looking at incorporating byteplay, porting it to C, and providing a C API for it rather than designing something new. Besides the fact that it hews pretty closely to the design of the dis module, which is already in the stdlib, a variety of people have been using it for years, and I think experience shows that it makes transforming code objects, including inserting and moving ops, easy enough. It also takes care of other annoying things, like getting all 17 of the constructor arguments to code right. But I think Serhiy's much simpler suggestion may be good enough: making NOP removal part of the repack process means that a simple optimizer (like the existing peephole optimizer) never has to renumber jumps or lnotab, and using a flat array of fixed-size instructions means that even if more complicated optimizers do have to renumber, it's easy instead of complex. I won't really know how his idea pans out until I build it and play with it, but I'm hopeful. At any rate, whatever format we expose (unpacked bytecode, a macro-assembler-ish format, the internal compiler blocks, etc.) obviously determines the API for PEP 511 bytecode processors.
My plan is to implement most implementations at AST level and only optimizer jumps at "bytecode" level.
Sure, you've already explained (and Serhiy recently re-explained) that most of what the peephole optimizer does, and much of what you'd want it to do but it doesn't yet do, can be done at the AST level. But there will still be a few things that have to be done with bytecode. (Also, consider that things like decorators have to work on bytecode. And that, while your const optimization can be done at the AST level if we add a new AST node, that wouldn't help for anyone who had the same idea and just wanted to experiment with it, or deploy it locally, etc., without patching the compiler and the ast module.)
Do you know if byteplay and other projects have a similar design?
The design is pretty well documented. But it's basically a label-based design. Each original jump target is given a label, and you can add new labels anywhere (or give them nice names). At the end, when you convert a byteplay.Code object back to a code object, it maps those labels to offsets.
It would be nice to find the root common structure to be a generic API.
I don't know if there's anything common between labels, blocks, generalized trees, ... that's worth capturing here. We just want to find the simplest structure that works.
Flattening it would be trivial. Or, if that's too expensive, we could do something almost as simple and much cheaper: convert it in-line to a deque-like linked list of arrays, with jump targets being indices or pointers into that. Or we could just expose the list of blocks as-is, as an opaque thing with a mutable-deque-of-instructions API around it.
For the PEP 511 we can imagine multiple levels for code tranformers: block of instructions, flatten bytecode, AST, etc.
I don't think anyone will ever want to write transformers at all possible levels. If you can access source bytes, source text, token stream, AST, and some friendlier version of bytecode, when would you want to access the ASDL, or the unfriendly bytecode, or any other stage in the process?
The cost of the transformation is to convert internal from CPython objects to Python objects, and then convert the result of the code transformer back to internal CPython objects.
For bytecode, the compiler is already creating Python bytes and list objects to pass to the optimizer. (The AST is another story.)
![](https://secure.gravatar.com/avatar/daa45563a98419bb1b6b63904ce71f95.jpg?s=120&d=mm&r=g)
2016-02-05 20:58 GMT+01:00 Andrew Barnert via Python-ideas <python-ideas@python.org>:
[^3]: The compiler doesn't actually have exactly what the optimizers would want, but it's pretty close: it has a linked list of block objects, each of which has an array of instruction objects, with jump targets being pointers to blocks.
This thread was hijacked by discussion the bytecode bytes format. I was confused by the discussion on extended arguments and size of bytecode instructions in bytes. Hopefully, the annoying case of "extended arguments" does not matter here! Compiler instructions use 32-bit signed integer (let's say "integers without arbitrary limit :-D"), and EXTENDED_ARG instructions are only emitted later when the blocks of instructions are compiled to effective bytecode. That's another nice advantage to work on blocks of instructions :-) Victor
![](https://secure.gravatar.com/avatar/7e41acaa8f6a0e0f5a7c645e93add55a.jpg?s=120&d=mm&r=g)
On Feb 5, 2016, at 16:06, Victor Stinner <victor.stinner@gmail.com> wrote:
2016-02-05 20:58 GMT+01:00 Andrew Barnert via Python-ideas <python-ideas@python.org>:
[^3]: The compiler doesn't actually have exactly what the optimizers would want, but it's pretty close: it has a linked list of block objects, each of which has an array of instruction objects, with jump targets being pointers to blocks.
This thread was hijacked by discussion the bytecode bytes format. I was confused by the discussion on extended arguments and size of bytecode instructions in bytes.
Hopefully, the annoying case of "extended arguments" does not matter here! Compiler instructions use 32-bit signed integer (let's say "integers without arbitrary limit :-D"), and EXTENDED_ARG instructions are only emitted later when the blocks of instructions are compiled to effective bytecode.
That's exactly how it works today--except that the optimizer is on the wrong side of the boundary; it gets the emitted EXTENDED_ARG instructions, and had to do jump target and lnotab fixups that way. So, either we want to move the optimizer across that boundary (meaning we have to expose some of fragile compiler internals), or we have to come up with a public representation that's easier to work on. I think Serhiy's "unpacked bytecode" may be a good enough version of the latter--and it's dead easy to build.
participants (6)
-
Andrew Barnert
-
Cesare Di Mauro
-
MRAB
-
Serhiy Storchaka
-
Victor Stinner
-
Yury Selivanov