Wordcode: new regular bytecode using 16-bit units

Hi, In the middle of recent discussions about Python performance, it was discussed to change the Python bytecode. Serhiy proposed to reuse MicroPython short bytecode to reduce the disk space and reduce the memory footprint. Demur Rumed proposes a different change to use a regular bytecode using 16-bit units: an instruction has always one 8-bit argument, it's zero if the instruction doesn't have an argument: http://bugs.python.org/issue26647 According to benchmarks, it looks faster: http://bugs.python.org/issue26647#msg263339 IMHO it's a nice enhancement: it makes the code simpler. The most interesting change is made in Python/ceval.c: - if (HAS_ARG(opcode)) - oparg = NEXTARG(); + oparg = NEXTARG(); This code is the very hot loop evaluating Python bytecode. I expect that removing a conditional branch here can reduce the CPU branch misprediction. I reviewed first versions of the change, and IMHO it's almost ready to be merged. But I would prefer to have a review from a least a second core reviewer. Can someone please review the change? -- The side effect of wordcode is that arguments in 0..255 now uses 2 bytes per instruction instead of 3, so it also reduce the size of bytecode for the most common case. Larger argument, 16-bit argument (0..65,535), now uses 4 bytes instead of 3. Arguments are supported up to 32-bit: 24-bit uses 3 units (6 bytes), 32-bit uses 4 units (8 bytes). MAKE_FUNCTION uses 16-bit argument for keyword defaults and 24-bit argument for annotations. Other common instruction known to use large argument are jumps for bytecode longer than 256 bytes. -- Right now, ceval.c still fetchs opcode and then oparg with two 8-bit instructions. Later, we can discuss if it would be possible to ensure that the bytecode is always aligned to 16-bit in memory to fetch the two bytes using a uint16_t* pointer. Maybe we can overallocate 1 byte in codeobject.c and align manually the memory block if needed. Or ceval.c should maybe copy the code if it's not aligned? Raymond Hettinger proposes something like that, but it looks like there are concerns about non-aligned memory accesses: http://bugs.python.org/issue25823 The cost of non-aligned memory accesses depends on the CPU architecture, but it can raise a SIGBUS on some arch (MIPS and SPARC?). Victor

Nice work. I think that for CPython, speed is much more important than memory use for the code. Disk space is practically free for anything smaller than a video. :-) On Wed, Apr 13, 2016 at 9:24 AM, Victor Stinner <victor.stinner@gmail.com> wrote:
-- --Guido van Rossum (python.org/~guido)

On 13.04.16 19:33, Guido van Rossum wrote:
I collected statistics for use opcodes with different arguments during running CPython tests. Estimated size with using wordcode is 1.33 times less than with using current bytecode. [1] http://comments.gmane.org/gmane.comp.python.ideas/38293

I think the word code patch should go in sooner rather than later. Several of us have been through the patch and it is in pretty good shape (some parts still need work though). The earlier this goes in, the more time we'll have to shake out any unexpected secondary effects. perfect-is-the-enemy-of-good-ly yours, Raymond P.S. The patch is smaller, more tractable, and in better shape than the C version of OrderedDict was when it went in.

Hi Raymond, 2016-04-24 21:45 GMT+02:00 Raymond Hettinger <raymond.hettinger@gmail.com>:
I think the word code patch should go in sooner rather than later. Several of us have been through the patch and it is in pretty good shape (some parts still need work though). The earlier this goes in, the more time we'll have to shake out any unexpected secondary effects.
Yury Selivanov and Serhiy Storchaka told me that they will review shortly the patch. I give them one more week and then I will push the patch. I agree that the patch is in a good shape. I reviewed first versions of the change. I pushed some minor and obvious changes. I also asked to revert unrelated changes. I proposed to not try to optimize ceval.c to fetch (oparg, opval) in a single 16-bit operation. It should be easy to implement it later, but I prefer to focus on changing the format of the bytecode. Victor

Improving instruction decoding was the whole point and it was what kicked-off the work on the patch. It is also where most of the performance improvement comes from and isn't the difficult part of the patch. The persnickety parts of the patch lay elsewhere, so there is really nothing to be gained gutting out our actual objective. The OPs original patch had already gotten this part done and it ran fine for me. Raymond

Unless it is presenting a tough review challenge, we should do whatever we can to make it easier on the OP who seems to be working with very limited computational resources (I had to run the benchmarks for him because his setup lacked the requisite resources). He's already put a lot of work into the patch which is pretty good shape when it arrived. The opcode/oparg fetch logic is mostly already isolated to the part of the patch that touches ceval.c. I found that part to be relatively clean and clear. The part that took the most time to go through was for peephole.c. How about we let Yury and Serhiy take a pass at it as is. And, if they would benefit from splitting the patch into parts, then perhaps one of us with better tooling can pitch in to the help the OP. Raymond

On Wednesday, April 13, 2016 09:25, Victor Stinner wrote:
A couple months ago during an earlier discussion of wordcode, I got curious enough to instrument dis.dis so that I could calculate the actual size changes expected in practice. I ran it on a large chunk of our product code, here are the results (looks best with a fixed font). I suspect the fairly significant reduction in footprint will also give better cache hit characteristics, so we might see some "magic" speed ups from that, too. Code-generating source lines = 70,792 Total bytes = 1,196,653 Argument-bearing operators = 380,978 Operands over 1 byte long = 12,191 Extended arguments = 0 Percentage of 1-byte args = 96.80% Total operators = 434,697 Non-argument ops = 53,719 One-byte args = 368,787 Multi-byte args = 12,191 Byte code size = 1,196,653 Word code size = 893,776 Word:byte size = 74.69% Just for the record, here's my arithmetic: byteCodeSize = 1*nonArgumentOps + 3*oneByteArgs + 3*multiByteArgs wordCodeSize = 2*nonArgumentOps + 2*oneByteArgs + 4*multiByteArgs (It is interesting to note that I have never encountered an EXTENDED_ARG operator in the wild, only in my own synthetic examples.)

2016-04-13 23:02 GMT+02:00 Eric Fahlgren <ericfahlgren@gmail.com>:
Percentage of 1-byte args = 96.80%
Yeah, I expected such high ratio. Good news that you confirm it.
Again, only a very few arguments take multiple bytes. Good, the bytecode will be smaller. IMHO it's more a nice side effect than a real goal. The runtime performance matters more than the size of the bytecode, it's not like a bytecode take 4 MB. It's probably closer to 1 KB and so can probably benefit of the fatest CPU caches.
If multiByteArgs means any size > 1 byte, the wordCodeSize formula is wrong: - no parameter: 2 bytes - 8-bit parameter: 2 bytes - 16-bit parameter: 4 bytes - 24-bit parameter: 6 bytes - 32-bit parameter: 8 bytes But you wrote that you didn't see EXTEND_ARG, so I guess that multibyte means 16-bit in your case, and so your formula is correct. Hopefully, I don't expect 32-bit parameters in the wild, only 24-bit parameter for function with annotation.
(It is interesting to note that I have never encountered an EXTENDED_ARG operator in the wild, only in my own synthetic examples.)
As I wrote, EXTENDED_ARG can be seen when MAKE_FUNCTION is used with annotations. Victor

The EXTENDED_ARG is included in the multibyte ops, I treat it just like any other operator. Here's a snippet of my hacked-dis.dis output, which made it clear to me that I could just count them as an "operator with word operand." Line 3000: x = x if x or not x and x is None else x 0001dc83 7c 00 00 LOAD_FAST x 0001dc86 91 01 00 EXTENDED_ARG 1 0001dc89 70 9f dc JUMP_IF_TRUE_OR_POP L1dc9f 0001dc8c 7c 00 00 LOAD_FAST x 0001dc8f 0c UNARY_NOT 0001dc90 91 01 00 EXTENDED_ARG 1 0001dc93 6f 9f dc JUMP_IF_FALSE_OR_POPL1dc9f 0001dc96 7c 00 00 LOAD_FAST x 0001dc99 74 01 00 LOAD_GLOBAL None 0001dc9c 6b 08 00 COMPARE_OP 'is' L1dc9f: 0001dc9f 91 01 00 EXTENDED_ARG 1 0001dca2 72 ab dc POP_JUMP_IF_FALSE L1dcab 0001dca5 7c 00 00 LOAD_FAST x 0001dca8 6e 03 00 JUMP_FORWARD L1dcae (+3) L1dcab: 0001dcab 7c 00 00 LOAD_FAST x L1dcae: 0001dcae 7d 00 00 STORE_FAST x On Wed, Apr 13, 2016 at 2:23 PM, Victor Stinner <victor.stinner@gmail.com> wrote:

2016-04-13 23:23 GMT+02:00 Victor Stinner <victor.stinner@gmail.com>:
Hopefully, I don't expect 32-bit parameters in the wild, only 24-bit parameter for function with annotation.
I never found 32-bit parameters, and not even 24-bit ones. I think that their usage is as rare as all planets alignment. ;-) That's why with in WPython I supported only 8, 16, and 32-bit parameters (which are 6 bytes long). Regards, Cesare

What is the value of HAS_ARG going to be now? -- Ryan [ERROR]: Your autotools build scripts are 200 lines longer than your program. Something’s wrong. http://kirbyfan64.github.io/ On Apr 13, 2016 11:26 AM, "Victor Stinner" <victor.stinner@gmail.com> wrote:

Le mercredi 13 avril 2016, Ryan Gonzalez <rymg19@gmail.com> a écrit :
What is the value of HAS_ARG going to be now?
I asked Demur to keep HAS_ARG(). Not really for backward compatibility, but for the dis module: to keep a nice assembler. There are also debug traces in ceval.c which use it. For ceval.c, we might use HAS_ARG() to micro-optimize oparg=0 (hardcode 0 rather than reading the bytecode) for operators with no argument. Or maybe it's completly useless :-) Victor

So code that depends on iterating through bytecode via HAS_ARG is going to break... Darn it. :/ -- Ryan [ERROR]: Your autotools build scripts are 200 lines longer than your program. Something’s wrong. http://kirbyfan64.github.io/ On Apr 13, 2016 4:44 PM, "Victor Stinner" <victor.stinner@gmail.com> wrote:

2016-04-14 0:11 GMT+02:00 Ryan Gonzalez <rymg19@gmail.com>:
So code that depends on iterating through bytecode via HAS_ARG is going to break...
Sure. This change is backward incompatible for applications parsing bytecode in C or Python. That's why the patch also has to update the dis module. I don't see how you plan to keep the backwad compatibility, since the argument size changed from 2 bytes to 1 byte. You must update your code (written in C or Python or whatever). Hopefully, the dis was enhanced in Python 3.4: get_instructions() now gives nice Instructon objects rather than only pure text output. FYI I wrote my own library to decode and decode bytecode. It provides abstract bytecode objects to easily modify bytecode: https://bytecode.readthedocs.org/ I suggest to use such library (or simply the dis module for simple needs) if you have to handle bytecode, rather than writing your own code. I know a few other projects which handle directly bytecode: * https://pypi.python.org/pypi/codetransformer * https://github.com/serprex/byteplay * https://pypi.python.org/pypi/coverage IHMO it's not a big deal to update these projects for the future Python 3.6. I can even help them to support the new bytecode format. Victor

On 14 April 2016 at 08:26, Victor Stinner <victor.stinner@gmail.com> wrote:
+1 We've also had previous discussions on adding a "minimum viable bytecode editing" API to the standard library, and updating these third party modules to support wordcode instead of bytecode could provide a good use-case-driven opportunity for defining that (i.e. it wouldn't be about providing an end user facing API directly, but rather about letting CPython take care of the bookkeeping details for things like lnotab and sorting out jump targets). Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

Le jeudi 14 avril 2016, Nick Coghlan <ncoghlan@gmail.com> a écrit :
Yeah, I know well this discussion since it started with my PEP 511. I wrote the bytecode as a tool for the discussion, to try to understand better the use case. The main task was to design the API. I first looked at byteplay and codetranformer projects, but I found some issues in their design. Their API has some design issues. IMHO their API is not the best to modify bytecode. My goal is to support Bytecode.from_code(code).to_code()==code: store enough information to be able to emit again exactly the same bytecode (line numbers, exact argument value, etc.). I started with a long email, but I decided to document differences in bytecode documentation: https://bytecode.readthedocs.org/en/latest/byteplay_codetransformer.html Victor

2016-04-13 18:24 GMT+02:00 Victor Stinner <victor.stinner@gmail.com>:
Correct. The old bytecode format wasn't so much predictable for the CPU.
It should not be a problem, since every PyObject is allocated with PyAlloc (however I don't remember if it's the correct name) which AFAIK guarantees a base 8 bytes alignment. So, it's safe to use an unsigned int for keeping/referencing a word at the time. The only problem with such approach is related to the processor endianess, but it can be solved with proper macros (like I did with WPython). Regards, Cesare

Nice work. I think that for CPython, speed is much more important than memory use for the code. Disk space is practically free for anything smaller than a video. :-) On Wed, Apr 13, 2016 at 9:24 AM, Victor Stinner <victor.stinner@gmail.com> wrote:
-- --Guido van Rossum (python.org/~guido)

On 13.04.16 19:33, Guido van Rossum wrote:
I collected statistics for use opcodes with different arguments during running CPython tests. Estimated size with using wordcode is 1.33 times less than with using current bytecode. [1] http://comments.gmane.org/gmane.comp.python.ideas/38293

I think the word code patch should go in sooner rather than later. Several of us have been through the patch and it is in pretty good shape (some parts still need work though). The earlier this goes in, the more time we'll have to shake out any unexpected secondary effects. perfect-is-the-enemy-of-good-ly yours, Raymond P.S. The patch is smaller, more tractable, and in better shape than the C version of OrderedDict was when it went in.

Hi Raymond, 2016-04-24 21:45 GMT+02:00 Raymond Hettinger <raymond.hettinger@gmail.com>:
I think the word code patch should go in sooner rather than later. Several of us have been through the patch and it is in pretty good shape (some parts still need work though). The earlier this goes in, the more time we'll have to shake out any unexpected secondary effects.
Yury Selivanov and Serhiy Storchaka told me that they will review shortly the patch. I give them one more week and then I will push the patch. I agree that the patch is in a good shape. I reviewed first versions of the change. I pushed some minor and obvious changes. I also asked to revert unrelated changes. I proposed to not try to optimize ceval.c to fetch (oparg, opval) in a single 16-bit operation. It should be easy to implement it later, but I prefer to focus on changing the format of the bytecode. Victor

Improving instruction decoding was the whole point and it was what kicked-off the work on the patch. It is also where most of the performance improvement comes from and isn't the difficult part of the patch. The persnickety parts of the patch lay elsewhere, so there is really nothing to be gained gutting out our actual objective. The OPs original patch had already gotten this part done and it ran fine for me. Raymond

Unless it is presenting a tough review challenge, we should do whatever we can to make it easier on the OP who seems to be working with very limited computational resources (I had to run the benchmarks for him because his setup lacked the requisite resources). He's already put a lot of work into the patch which is pretty good shape when it arrived. The opcode/oparg fetch logic is mostly already isolated to the part of the patch that touches ceval.c. I found that part to be relatively clean and clear. The part that took the most time to go through was for peephole.c. How about we let Yury and Serhiy take a pass at it as is. And, if they would benefit from splitting the patch into parts, then perhaps one of us with better tooling can pitch in to the help the OP. Raymond

On Wednesday, April 13, 2016 09:25, Victor Stinner wrote:
A couple months ago during an earlier discussion of wordcode, I got curious enough to instrument dis.dis so that I could calculate the actual size changes expected in practice. I ran it on a large chunk of our product code, here are the results (looks best with a fixed font). I suspect the fairly significant reduction in footprint will also give better cache hit characteristics, so we might see some "magic" speed ups from that, too. Code-generating source lines = 70,792 Total bytes = 1,196,653 Argument-bearing operators = 380,978 Operands over 1 byte long = 12,191 Extended arguments = 0 Percentage of 1-byte args = 96.80% Total operators = 434,697 Non-argument ops = 53,719 One-byte args = 368,787 Multi-byte args = 12,191 Byte code size = 1,196,653 Word code size = 893,776 Word:byte size = 74.69% Just for the record, here's my arithmetic: byteCodeSize = 1*nonArgumentOps + 3*oneByteArgs + 3*multiByteArgs wordCodeSize = 2*nonArgumentOps + 2*oneByteArgs + 4*multiByteArgs (It is interesting to note that I have never encountered an EXTENDED_ARG operator in the wild, only in my own synthetic examples.)

2016-04-13 23:02 GMT+02:00 Eric Fahlgren <ericfahlgren@gmail.com>:
Percentage of 1-byte args = 96.80%
Yeah, I expected such high ratio. Good news that you confirm it.
Again, only a very few arguments take multiple bytes. Good, the bytecode will be smaller. IMHO it's more a nice side effect than a real goal. The runtime performance matters more than the size of the bytecode, it's not like a bytecode take 4 MB. It's probably closer to 1 KB and so can probably benefit of the fatest CPU caches.
If multiByteArgs means any size > 1 byte, the wordCodeSize formula is wrong: - no parameter: 2 bytes - 8-bit parameter: 2 bytes - 16-bit parameter: 4 bytes - 24-bit parameter: 6 bytes - 32-bit parameter: 8 bytes But you wrote that you didn't see EXTEND_ARG, so I guess that multibyte means 16-bit in your case, and so your formula is correct. Hopefully, I don't expect 32-bit parameters in the wild, only 24-bit parameter for function with annotation.
(It is interesting to note that I have never encountered an EXTENDED_ARG operator in the wild, only in my own synthetic examples.)
As I wrote, EXTENDED_ARG can be seen when MAKE_FUNCTION is used with annotations. Victor

The EXTENDED_ARG is included in the multibyte ops, I treat it just like any other operator. Here's a snippet of my hacked-dis.dis output, which made it clear to me that I could just count them as an "operator with word operand." Line 3000: x = x if x or not x and x is None else x 0001dc83 7c 00 00 LOAD_FAST x 0001dc86 91 01 00 EXTENDED_ARG 1 0001dc89 70 9f dc JUMP_IF_TRUE_OR_POP L1dc9f 0001dc8c 7c 00 00 LOAD_FAST x 0001dc8f 0c UNARY_NOT 0001dc90 91 01 00 EXTENDED_ARG 1 0001dc93 6f 9f dc JUMP_IF_FALSE_OR_POPL1dc9f 0001dc96 7c 00 00 LOAD_FAST x 0001dc99 74 01 00 LOAD_GLOBAL None 0001dc9c 6b 08 00 COMPARE_OP 'is' L1dc9f: 0001dc9f 91 01 00 EXTENDED_ARG 1 0001dca2 72 ab dc POP_JUMP_IF_FALSE L1dcab 0001dca5 7c 00 00 LOAD_FAST x 0001dca8 6e 03 00 JUMP_FORWARD L1dcae (+3) L1dcab: 0001dcab 7c 00 00 LOAD_FAST x L1dcae: 0001dcae 7d 00 00 STORE_FAST x On Wed, Apr 13, 2016 at 2:23 PM, Victor Stinner <victor.stinner@gmail.com> wrote:

2016-04-13 23:23 GMT+02:00 Victor Stinner <victor.stinner@gmail.com>:
Hopefully, I don't expect 32-bit parameters in the wild, only 24-bit parameter for function with annotation.
I never found 32-bit parameters, and not even 24-bit ones. I think that their usage is as rare as all planets alignment. ;-) That's why with in WPython I supported only 8, 16, and 32-bit parameters (which are 6 bytes long). Regards, Cesare

What is the value of HAS_ARG going to be now? -- Ryan [ERROR]: Your autotools build scripts are 200 lines longer than your program. Something’s wrong. http://kirbyfan64.github.io/ On Apr 13, 2016 11:26 AM, "Victor Stinner" <victor.stinner@gmail.com> wrote:

Le mercredi 13 avril 2016, Ryan Gonzalez <rymg19@gmail.com> a écrit :
What is the value of HAS_ARG going to be now?
I asked Demur to keep HAS_ARG(). Not really for backward compatibility, but for the dis module: to keep a nice assembler. There are also debug traces in ceval.c which use it. For ceval.c, we might use HAS_ARG() to micro-optimize oparg=0 (hardcode 0 rather than reading the bytecode) for operators with no argument. Or maybe it's completly useless :-) Victor

So code that depends on iterating through bytecode via HAS_ARG is going to break... Darn it. :/ -- Ryan [ERROR]: Your autotools build scripts are 200 lines longer than your program. Something’s wrong. http://kirbyfan64.github.io/ On Apr 13, 2016 4:44 PM, "Victor Stinner" <victor.stinner@gmail.com> wrote:

2016-04-14 0:11 GMT+02:00 Ryan Gonzalez <rymg19@gmail.com>:
So code that depends on iterating through bytecode via HAS_ARG is going to break...
Sure. This change is backward incompatible for applications parsing bytecode in C or Python. That's why the patch also has to update the dis module. I don't see how you plan to keep the backwad compatibility, since the argument size changed from 2 bytes to 1 byte. You must update your code (written in C or Python or whatever). Hopefully, the dis was enhanced in Python 3.4: get_instructions() now gives nice Instructon objects rather than only pure text output. FYI I wrote my own library to decode and decode bytecode. It provides abstract bytecode objects to easily modify bytecode: https://bytecode.readthedocs.org/ I suggest to use such library (or simply the dis module for simple needs) if you have to handle bytecode, rather than writing your own code. I know a few other projects which handle directly bytecode: * https://pypi.python.org/pypi/codetransformer * https://github.com/serprex/byteplay * https://pypi.python.org/pypi/coverage IHMO it's not a big deal to update these projects for the future Python 3.6. I can even help them to support the new bytecode format. Victor

On 14 April 2016 at 08:26, Victor Stinner <victor.stinner@gmail.com> wrote:
+1 We've also had previous discussions on adding a "minimum viable bytecode editing" API to the standard library, and updating these third party modules to support wordcode instead of bytecode could provide a good use-case-driven opportunity for defining that (i.e. it wouldn't be about providing an end user facing API directly, but rather about letting CPython take care of the bookkeeping details for things like lnotab and sorting out jump targets). Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

Le jeudi 14 avril 2016, Nick Coghlan <ncoghlan@gmail.com> a écrit :
Yeah, I know well this discussion since it started with my PEP 511. I wrote the bytecode as a tool for the discussion, to try to understand better the use case. The main task was to design the API. I first looked at byteplay and codetranformer projects, but I found some issues in their design. Their API has some design issues. IMHO their API is not the best to modify bytecode. My goal is to support Bytecode.from_code(code).to_code()==code: store enough information to be able to emit again exactly the same bytecode (line numbers, exact argument value, etc.). I started with a long email, but I decided to document differences in bytecode documentation: https://bytecode.readthedocs.org/en/latest/byteplay_codetransformer.html Victor

2016-04-13 18:24 GMT+02:00 Victor Stinner <victor.stinner@gmail.com>:
Correct. The old bytecode format wasn't so much predictable for the CPU.
It should not be a problem, since every PyObject is allocated with PyAlloc (however I don't remember if it's the correct name) which AFAIK guarantees a base 8 bytes alignment. So, it's safe to use an unsigned int for keeping/referencing a word at the time. The only problem with such approach is related to the processor endianess, but it can be solved with proper macros (like I did with WPython). Regards, Cesare
participants (9)
-
Cesare Di Mauro
-
Eric Fahlgren
-
Guido van Rossum
-
Nick Coghlan
-
Raymond Hettinger
-
Ryan Gonzalez
-
Serhiy Storchaka
-
Victor Stinner
-
Yury Selivanov