![](https://secure.gravatar.com/avatar/4c01705256aa2160c1354790e8c154db.jpg?s=120&d=mm&r=g)
Damien George mentioned [1] that in MicroPython there are 16 dedicated opcodes for loading from positions 0-15, and 16 for storing to these positions. This not only makes a bytecode more compact, but might make CPython slightly faster too. I have collected statistic for opcodes generated in CPython during running the compileall module for the Lib/ directory. Raw data file is attached. Here is top-16 most used opcodes. Column 4 is the percentage among all opcodes. Columns 5-9 are percents of opcodes with argument that exceeds specified width. # op name % >2bit >3bit >4bit >8bit >16bit 1. 100 LOAD_CONST 22.2 57.8 39.7 25.8 3.6 0 2. 124 LOAD_FAST 14.7 19.3 5.95 1.04 0 0 3. 106 LOAD_ATTR 10.6 49.5 23.1 6.61 0 0 4. 131 CALL_FUNCTION 10.3 8.54 5.68 5.6 5.6 0 5. 116 LOAD_GLOBAL 5.8 39.2 16.7 4.21 0 0 6. 1 POP_TOP 5.11 7. 90 STORE_NAME 4.04 69.5 50.4 33.1 4.03 0 8. 125 STORE_FAST 3.8 39.4 12.8 2.24 0 0 9. 83 RETURN_VALUE 2.68 10. 132 MAKE_FUNCTION 2.16 1.33 0.861 0.782 0.776 0.23 11. 101 LOAD_NAME 1.81 64 49.9 35.3 2.35 0 12. 114 POP_JUMP_IF_FALSE 1.39 100 100 99.5 27.4 0 13. 102 BUILD_TUPLE 1.28 12.4 2.54 0.211 0 0 14. 107 COMPARE_OP 1.15 56.6 36.3 0 0 0 15. 87 POP_BLOCK 0.89 16. 110 JUMP_FORWARD 0.734 75.1 69.6 58.2 2.36 0 13.7% of opcodes doesn't have an argument. None argument exceeds 24 bits. Only 0.005% opcodes have an argument exceeded 16 bits and required EXTENDED_ARG (these opcodes are MAKE_FUNCTION and MAKE_CLOSURE). Only 2% of opcodes have an argument exceeded 8 bits. This is mostly jumping opcodes. More that every 5th opcode is LOAD_CONST. Only few files like Lib/html/__init__.py or Lib/locale.py contain LOAD_CONST with argument larger than 1024. They contain a large number of constants. Arguments larger than 256 are used mostly in encoding files for constructing mappings that are never used. Mean size of opcode is 2.73 bytes. Here is top-32 of most used opcodes with arguments: # op name arg % cum% 1. 124 LOAD_FAST 0 5.76 5.76 2. 131 CALL_FUNCTION 1 4.27 10.03 3. 124 LOAD_FAST 1 3.10 13.13 4. 131 CALL_FUNCTION 2 2.87 16.00 5. 100 LOAD_CONST 1 2.81 18.80 6. 100 LOAD_CONST 0 2.67 21.47 7. 100 LOAD_CONST 2 2.20 23.67 8. 132 MAKE_FUNCTION 0 1.98 25.64 9. 124 LOAD_FAST 2 1.86 27.51 10. 100 LOAD_CONST 3 1.68 29.19 11. 131 CALL_FUNCTION 0 1.54 30.72 12. 106 LOAD_ATTR 0 1.44 32.16 13. 106 LOAD_ATTR 1 1.39 33.55 14. 106 LOAD_ATTR 2 1.36 34.91 15. 116 LOAD_GLOBAL 0 1.34 36.24 16. 100 LOAD_CONST 4 1.31 37.56 17. 124 LOAD_FAST 3 1.18 38.73 18. 106 LOAD_ATTR 3 1.17 39.91 19. 100 LOAD_CONST 5 1.09 41.00 20. 125 STORE_FAST 1 0.95 41.94 21. 106 LOAD_ATTR 4 0.94 42.88 22. 116 LOAD_GLOBAL 1 0.93 43.81 23. 100 LOAD_CONST 6 0.89 44.69 24. 124 LOAD_FAST 4 0.78 45.48 25. 131 CALL_FUNCTION 3 0.76 46.23 26. 106 LOAD_ATTR 5 0.74 46.98 27. 125 STORE_FAST 2 0.74 47.72 28. 100 LOAD_CONST 7 0.73 48.45 29. 116 LOAD_GLOBAL 2 0.73 49.18 30. 102 BUILD_TUPLE 2 0.68 49.86 31. 106 LOAD_ATTR 6 0.62 50.47 32. 100 LOAD_CONST 8 0.61 51.08 There are two ways to make Python opcodes more compact. 1. Add 1-byte dedicated opcodes for most used opcodes with arguments. If add 32 short opcodes for 6 most popular opcodes: LOAD_CONST and LOAD_ATTR with arguments from 0 to 7, LOAD_FAST, STORE_FAST, LOAD_GLOBAL and CALL_FUNCTION with arguments from 0 to 3, this will decrease mean size of opcode from 2.73 to 1.75 bytes (1.56 times). This might make decoding of 49% opcodes slightly faster due to avoiding decoding argument. This is relative simple and compatible way. But not fully compatible. We have either shift numbers for opcodes with arguments to free space for additional opcodes, or add new range for argumentless opcodes at the end of 8-bit space. Third-party tools that work with CPython bytecode should be updated. 2. Use 16-bit opcodes as in WPython. This will decrease mean size of opcode from 2.73 to 2.044 bytes (1.33 times). But this might make decoding of 85% opcodes slightly faster. Raymond shown a roughly 10+% speedup in code exercising simple and common opcodes that that have a oparg when just optimize oparg decoding. [2] Using 16-bit opcodes with embedded oparg might have even larger effect. This change needs more serious reworking of compiler and interpreter, not only disassembler and third-party tools. [1] http://permalink.gmane.org/gmane.comp.python.devel/156162 [2] http://bugs.python.org/issue25823
![](https://secure.gravatar.com/avatar/daa45563a98419bb1b6b63904ce71f95.jpg?s=120&d=mm&r=g)
2016-02-04 14:03 GMT+01:00 Serhiy Storchaka <storchaka@gmail.com>:
Damien George mentioned [1] that in MicroPython there are 16 dedicated opcodes for loading from positions 0-15, and 16 for storing to these positions. This not only makes a bytecode more compact, but might make CPython slightly faster too.
MicroPython has a very important requirements on disk and memory footprint. I don't think that it's the case for CPython.
1. Add 1-byte dedicated opcodes for most used opcodes with arguments.
It means duplicating a lot of code in ceval.c, no? (Even if we use C #define "templates"). I dislike this option.
2. Use 16-bit opcodes as in WPython.
IMHO this is a much more interesting option. It would be great to remove "if (HAS_ARG(opcode)) oparg = NEXTARG();" and always get the argument from the 16-bit opcode. This if() adds more work to the CPU which has to flush the pipeline on branch misprediction. Usually, it doesn't matter. For this if() is really part of the hot path code Python, it's the most important loop running the bytecode. So any instruction matters here ;-) You just have to handle correctly EXTENDED_ARG to large arguments. Victor
![](https://secure.gravatar.com/avatar/4c01705256aa2160c1354790e8c154db.jpg?s=120&d=mm&r=g)
On 04.02.16 18:16, Victor Stinner wrote:
1. Add 1-byte dedicated opcodes for most used opcodes with arguments.
It means duplicating a lot of code in ceval.c, no? (Even if we use C #define "templates"). I dislike this option.
This needs to add less than 40 lines of code in ceval.c. 5-line macro and 1 line per new opcode. A little more lines need to be added to compiler, peephole optimizer, opcode.py, opcode.h and the documentation.
2. Use 16-bit opcodes as in WPython.
IMHO this is a much more interesting option.
It would be great to remove "if (HAS_ARG(opcode)) oparg = NEXTARG();" and always get the argument from the 16-bit opcode. This if() adds more work to the CPU which has to flush the pipeline on branch misprediction. Usually, it doesn't matter. For this if() is really part of the hot path code Python, it's the most important loop running the bytecode. So any instruction matters here ;-)
Actually in common case this "if" is executed at compile time. But I expect a benefit from using simple read of 16-bit value instead of 3 reads of 8-bit values.
![](https://secure.gravatar.com/avatar/130fe9f08ce5d2b1716d32438a58c867.jpg?s=120&d=mm&r=g)
On 04.02.2016 17:55, Serhiy Storchaka wrote:
On 04.02.16 18:16, Victor Stinner wrote:
1. Add 1-byte dedicated opcodes for most used opcodes with arguments.
It means duplicating a lot of code in ceval.c, no? (Even if we use C #define "templates"). I dislike this option.
This needs to add less than 40 lines of code in ceval.c. 5-line macro and 1 line per new opcode. A little more lines need to be added to compiler, peephole optimizer, opcode.py, opcode.h and the documentation.
Do you have a working CPython to showcast this?
2. Use 16-bit opcodes as in WPython.
IMHO this is a much more interesting option.
It would be great to remove "if (HAS_ARG(opcode)) oparg = NEXTARG();" and always get the argument from the 16-bit opcode. This if() adds more work to the CPU which has to flush the pipeline on branch misprediction. Usually, it doesn't matter. For this if() is really part of the hot path code Python, it's the most important loop running the bytecode. So any instruction matters here ;-)
Actually in common case this "if" is executed at compile time. But I expect a benefit from using simple read of 16-bit value instead of 3 reads of 8-bit values.
Same question as above + does it make sense to combine those two options? Best, Sven
![](https://secure.gravatar.com/avatar/4c01705256aa2160c1354790e8c154db.jpg?s=120&d=mm&r=g)
On 04.02.16 19:27, Sven R. Kunze wrote:
On 04.02.2016 17:55, Serhiy Storchaka wrote:
1. Add 1-byte dedicated opcodes for most used opcodes with arguments.
Do you have a working CPython to showcast this?
Not yet. The interpreter is easy, but the compiler is more complicated in this case.
2. Use 16-bit opcodes as in WPython.
Same question as above + does it make sense to combine those two options?
Of course not. Having some opcodes to be 8-bit kills the benefit of all 16-bit opcodes. This options needs to rewrite more code, but resulting code can be easier. I afraid that might be problem with bootstrapping, generating freezed modules.
![](https://secure.gravatar.com/avatar/130fe9f08ce5d2b1716d32438a58c867.jpg?s=120&d=mm&r=g)
On 04.02.2016 19:55, Serhiy Storchaka wrote:
On 04.02.16 19:27, Sven R. Kunze wrote:
On 04.02.2016 17:55, Serhiy Storchaka wrote:
1. Add 1-byte dedicated opcodes for most used opcodes with arguments.
Do you have a working CPython to showcast this?
Not yet. The interpreter is easy, but the compiler is more complicated in this case.
2. Use 16-bit opcodes as in WPython.
Same question as above + does it make sense to combine those two options?
Of course not. Having some opcodes to be 8-bit kills the benefit of all 16-bit opcodes.
Are you sure? I still could imagine some benefit.
This options needs to rewrite more code, but resulting code can be easier. I afraid that might be problem with bootstrapping, generating freezed modules.
What option do you prefer then?
![](https://secure.gravatar.com/avatar/7e41acaa8f6a0e0f5a7c645e93add55a.jpg?s=120&d=mm&r=g)
On Thursday, February 4, 2016 12:13 PM, Sven R. Kunze <srkunze@mail.de> wrote:
On 04.02.2016 19:55, Serhiy Storchaka wrote:
On 04.02.16 19:27, Sven R. Kunze wrote:
On 04.02.2016 17:55, Serhiy Storchaka wrote:
1. Add 1-byte dedicated opcodes for most used opcodes with arguments.
Do you have a working CPython to showcast this?
Not yet. The interpreter is easy, but the compiler is more complicated in this case.
2. Use 16-bit opcodes as in WPython.
Same question as above + does it make sense to combine those two options?
Of course not. Having some opcodes to be 8-bit kills the benefit of all 16-bit opcodes.
Are you sure? I still could imagine some benefit.
The only benefits I can see for using 16-bit opcodes are: * You can treat the bytecode as an (unsigned short *) instead of (unsigned char *) and do 16-bit reads everywhere instead of 8-bit reads. This will no longer be true if you have to do an 8-bit read and then conditionally read more bits. * The ceval loop can be simplified if it doesn't have to deal with variable lengths. This will no longer be true if you have to deal with variable lengths. So, unless you're imagining something else, I don't think there is any benefit to be gotten by directly combining the two. You may be able to import some of the "spirit" of the first one into the second, but I'm not sure it's worth it. One of the minor problems with wordcode is that args in range [256, 65536) now take 4 bytes instead of 3. If you cram a few bits into the opcode byte, you can push the range farther back. I'm guessing it's pretty rare to do LOAD_CONST 256, but giving, say, 5 bits to JUMP_ABSOLUTE, and 2 bits to each of the various relative jumps, would cut down the need for EXTENDED_ARGS dramatically.
![](https://secure.gravatar.com/avatar/d67ab5d94c2fed8ab6b727b62dc1b213.jpg?s=120&d=mm&r=g)
On Fri, Feb 5, 2016 at 9:22 AM, Andrew Barnert via Python-ideas <python-ideas@python.org> wrote:
You may be able to import some of the "spirit" of the first one into the second, but I'm not sure it's worth it. One of the minor problems with wordcode is that args in range [256, 65536) now take 4 bytes instead of 3. If you cram a few bits into the opcode byte, you can push the range farther back. I'm guessing it's pretty rare to do LOAD_CONST 256, but giving, say, 5 bits to JUMP_ABSOLUTE, and 2 bits to each of the various relative jumps, would cut down the need for EXTENDED_ARGS dramatically.
At the cost of complexity, possibly including less readable code. This change can always be done later, if and only if it proves to make enough difference in performance. ChrisA
![](https://secure.gravatar.com/avatar/7e41acaa8f6a0e0f5a7c645e93add55a.jpg?s=120&d=mm&r=g)
On Thursday, February 4, 2016 6:57 PM, Chris Angelico <rosuav@gmail.com> wrote:
On Fri, Feb 5, 2016 at 9:22 AM, Andrew Barnert via Python-ideas <python-ideas@python.org> wrote:
You may be able to import some of the "spirit" of the first one into the second, but I'm not sure it's worth it. One of the minor problems with wordcode is that args in range [256, 65536) now take 4 bytes instead of 3. If you cram a few bits into the opcode byte, you can push the range farther back. I'm guessing it's pretty rare to do LOAD_CONST 256, but giving, say, 5 bits to JUMP_ABSOLUTE, and 2 bits to each of the various relative jumps, would cut down the need for EXTENDED_ARGS dramatically.
At the cost of complexity, possibly including less readable code. This change can always be done later, if and only if it proves to make enough difference in performance.
Definitely. And, of course, the extra complexity may actually slow things down. I'm just saying it _might_ be worth prototyping and testing if the wordcode experiment turns out to be worth pursuing[^1], and someone produces profiles showing that jumps are now a bigger bottleneck than they used to be. [^1]: We already knew that it was the starting point of a 2.6 fork that claimed to be faster than stock CPython, but that fork had lots of other optimizations, and I don't know of any serious benchmarking or profiling results from it in the first place. We now also know that changing to wordcode is probably easier than it looks. That may add up to "worth a few more hours to build a complete prototype", but it doesn't add up to "do it and merge it and start planning further changes around it" just yet. :)
![](https://secure.gravatar.com/avatar/d67ab5d94c2fed8ab6b727b62dc1b213.jpg?s=120&d=mm&r=g)
On Fri, Feb 5, 2016 at 3:25 PM, Andrew Barnert <abarnert@yahoo.com> wrote:
That may add up to "worth a few more hours to build a complete prototype", but it doesn't add up to "do it and merge it and start planning further changes around it" just yet. :)
Yep! But also, I'm inclined to reduce, rather than expand, the scope of the change - it's a few hours followed by a few more to do the jumps-with-some-bits-in-opcode, etc, etc. ChrisA
![](https://secure.gravatar.com/avatar/7e41acaa8f6a0e0f5a7c645e93add55a.jpg?s=120&d=mm&r=g)
On Thursday, February 4, 2016 8:55 AM, Serhiy Storchaka <storchaka@gmail.com> wrote:
On 04.02.16 18:16, Victor Stinner wrote:
2. Use 16-bit opcodes as in WPython.
IMHO this is a much more interesting option.
It would be great to remove "if (HAS_ARG(opcode)) oparg = NEXTARG();" and always get the argument from the 16-bit opcode. This if() adds more work to the CPU which has to flush the pipeline on branch misprediction. Usually, it doesn't matter. For this if() is really part of the hot path code Python, it's the most important loop running the bytecode. So any instruction matters here ;-)
Actually in common case this "if" is executed at compile time. But I expect a benefit from using simple read of 16-bit value instead of 3 reads of 8-bit values.
Only if the 16-bit reads are aligned. Can that be guaranteed somehow? If you change co_code to some type that's like bytes, but an immutable array of shorts instead of bytes, that would certainly do it. And it would be easier to inspect/hack 16-bit bytecodes from Python if you didn't have to look at them as bytes. But that implies creating the new type, exposing it to Python, and changing what co_code returns. Or it could just be a simple switch when constructing a code object: if the code argument is aligned, use it as-is; if not, copy its contents into an aligned array, build a new bytes around that, and store that instead. (Would that significantly slow down .pyc reads and other marshaling uses, if half of all code objects have to copy their contents?)
![](https://secure.gravatar.com/avatar/5ce43469c0402a7db8d0cf86fa49da5a.jpg?s=120&d=mm&r=g)
On 2016-02-04 17:28:45, "Andrew Barnert via Python-ideas" <python-ideas@python.org> wrote:
On Thursday, February 4, 2016 8:55 AM, Serhiy Storchaka <storchaka@gmail.com> wrote:
On 04.02.16 18:16, Victor Stinner wrote:
2. Use 16-bit opcodes as in WPython.
IMHO this is a much more interesting option.
It would be great to remove "if (HAS_ARG(opcode)) oparg = NEXTARG();" and always get the argument from the 16-bit opcode. This if() adds more work to the CPU which has to flush the pipeline on branch misprediction. Usually, it doesn't matter. For this if() is really part of the hot path code Python, it's the most important loop running the bytecode. So any instruction matters here ;-)
Actually in common case this "if" is executed at compile time. But I expect a benefit from using simple read of 16-bit value instead of 3 reads of 8-bit values.
Only if the 16-bit reads are aligned. Can that be guaranteed somehow?
Aren't allocated blocks usually aligned to n-byte boundaries (or multiples thereof) anyway, where n is the number of bytes needed to hold an address?
If you change co_code to some type that's like bytes, but an immutable array of shorts instead of bytes, that would certainly do it. And it would be easier to inspect/hack 16-bit bytecodes from Python if you didn't have to look at them as bytes. But that implies creating the new type, exposing it to Python, and changing what co_code returns.
Or it could just be a simple switch when constructing a code object: if the code argument is aligned, use it as-is; if not, copy its contents into an aligned array, build a new bytes around that, and store that instead. (Would that significantly slow down .pyc reads and other marshaling uses, if half of all code objects have to copy their contents?)
![](https://secure.gravatar.com/avatar/4c01705256aa2160c1354790e8c154db.jpg?s=120&d=mm&r=g)
On 04.02.16 19:28, Andrew Barnert via Python-ideas wrote:
Actually in common case this "if" is executed at compile time. But I expect a benefit from using simple read of 16-bit value instead of 3 reads of 8-bit values.
Only if the 16-bit reads are aligned. Can that be guaranteed somehow?
Yes, we can. On common platforms the start of bytes data is at least 32-bit aligned. Other code depends on this. It can be not aligned only on platform where the alignment doesn't matter.
![](https://secure.gravatar.com/avatar/61a537f7b31ecf682e3269ea04056e94.jpg?s=120&d=mm&r=g)
Serhiy, Victor, Andrew, What do you think about turning constant slice objects into actual constants? Right now: def a(): return a[:1] dis.dis(a) 1 0 LOAD_GLOBAL 0 (a) 3 LOAD_CONST 0 (None) 6 LOAD_CONST 1 (1) 9 BUILD_SLICE 2 12 BINARY_SUBSCR 13 RETURN_VALUE With this optimization we could just have: 1 0 LOAD_GLOBAL 0 (a) 3 LOAD_CONST 0 (:1) 6 BINARY_SUBSCR 7 RETURN_VALUE Yury
![](https://secure.gravatar.com/avatar/daa45563a98419bb1b6b63904ce71f95.jpg?s=120&d=mm&r=g)
2016-02-05 23:46 GMT+01:00 Yury Selivanov <yselivanov.ml@gmail.com>:
What do you think about turning constant slice objects into actual constants?
It doesn't look interesting. bench$ python3 -m timeit -s 'def f(list):' -s ' list[:1]; list[:1]; list[:1]; list[:1]; list[:1]' -s 'l=list("hello")' 'f(l)' 1000000 loops, best of 3: 1.34 usec per loop bench$ python3 -m timeit -s 'def f(list):' -s ' list["whatever"]; list["whatever"]; list["whatever"]; list["whatever"]; list["whatever"]' -s 'import types; co=f.__code__; consts=[slice(0, 1, None) if const == "whatever" else const for const in co.co_consts]; f.__code__ = types.CodeType(co.co_argcount, co.co_kwonlyargcount, co.co_nlocals, co.co_stacksize, co.co_flags, co.co_code, tuple(consts), co.co_names, co.co_varnames, co.co_filename, co.co_name, co.co_firstlineno, co.co_lnotab, co.co_freevars, co.co_cellvars); l=list("hello")' 'f(l)' 1000000 loops, best of 3: 1.3 usec per loop It's only 3% faster on a microbenchmark. (On my laptop: 687 ns => 658 ns: 4% faster) I'm not sure that the benchmark is "fair" since the slice object has a free list of 1 item which is probably always used in this benchmark. But you will be able to easily implement suck "hack" in fatoptimizer which already has similar optimizations. IMHO it's ok for an optimizer like fatoptimizer, it's not worth for the default builtin optimizer in CPython. Victor
![](https://secure.gravatar.com/avatar/61a537f7b31ecf682e3269ea04056e94.jpg?s=120&d=mm&r=g)
On 2016-02-05 7:49 PM, Serhiy Storchaka wrote:
On 06.02.16 00:46, Yury Selivanov wrote:
What do you think about turning constant slice objects into actual constants?
Slices are not supported by marshal.
That should be fixable, but in any case -- this is a nano-optimization, which is nice to have when there is nothing left to optimize ;) Yury
![](https://secure.gravatar.com/avatar/daa45563a98419bb1b6b63904ce71f95.jpg?s=120&d=mm&r=g)
2016-02-06 1:49 GMT+01:00 Serhiy Storchaka <storchaka@gmail.com>:
Slices are not supported by marshal.
It's possible to inject arbitrary objects to code constants at runtime if you replace the __code__ attribute of a function. I'm using that in FAT Python to inject builtin functions in constants. Later if we consider that the optimization has been proven to be efficient enough, we may consider to implement it directly in Python ;-) Victor
![](https://secure.gravatar.com/avatar/1fee087d7a1ca17c8ad348271819a8d5.jpg?s=120&d=mm&r=g)
Serhiy Storchaka <storchaka@...> writes:
2. Use 16-bit opcodes as in WPython.
This will decrease mean size of opcode from 2.73 to 2.044 bytes (1.33 times). But this might make decoding of 85% opcodes slightly faster.
It sounds like, by 16-bit opcodes, you mean combine the opcode and the argument in a single 16-bit word. But that doesn't solve the issue you want to solve: you still have to decode the argument encoded in the 16-bit word. I don't see where the benefit is. The *byte* size of bytecode is IMO largely unimportant. Python bytecode is synthetic and high-level, its expressive density is much higher than that of native code. I don't think the cache occupancy of bytecode is a significant contributor to Python performance (certainly much less so than the memory consumption of *objects*). (as for the disk occupancy of pyc files, if that is a concern we could actually compress those files; there are very fast compressors with decent efficiency these days, such as lzo or snappy) It is generally estimated the overhead of bytecode dispatch and decoding is around 10-30% for CPython. You cannot hope to eliminate that overhead entirely without writing a (JIT or AOT) compiler, so any heroic effort to restructure the current opcode space and structure will at best win 5 to 20% on select benchmarks. Regards Antoine.
![](https://secure.gravatar.com/avatar/7e41acaa8f6a0e0f5a7c645e93add55a.jpg?s=120&d=mm&r=g)
On Feb 6, 2016, at 11:18, Antoine Pitrou <antoine@python.org> wrote:
Serhiy Storchaka <storchaka@...> writes:
2. Use 16-bit opcodes as in WPython.
This will decrease mean size of opcode from 2.73 to 2.044 bytes (1.33 times). But this might make decoding of 85% opcodes slightly faster.
It sounds like, by 16-bit opcodes, you mean combine the opcode and the argument in a single 16-bit word. But that doesn't solve the issue you want to solve: you still have to decode the argument encoded in the 16-bit word. I don't see where the benefit is.
If (at least most of the time) the argument is just the second byte of the two bytes, that decoding is trivial. The cost should be less than the savings of doing one short read instead of three byte reads. (Of course we won't know that for sure until we try. It should also lets us simplify the eval loop a bit, and simplify all other code that has to deal with bytecode (although I'd like to simplify it even further, as discussed in the other thread).
It is generally estimated the overhead of bytecode dispatch and decoding is around 10-30% for CPython. You cannot hope to eliminate that overhead entirely without writing a (JIT or AOT) compiler, so any heroic effort to restructure the current opcode space and structure will at best win 5 to 20% on select benchmarks.
To be honest, unlike everyone else on this thread, I'm actually more interested in the simplicity gains than the performance gains. When I'm running CPU-bound code that spends most of its time in CPython itself, it's usually more than fast enough--and, when it isn't, it's almost always an order of magnitude too slow, so adding up all these little 5-10% gains is never going to get us near the point where I can stop using PyPy or Cython or something. I don't know whether other people have different use cases where these speedups really do matter, or if they're just chasing micro-optimization of the CPython core for its own sake, but I'm willing to cynically use their desire for an 8% speedup to get them onboard with selling my simplification. :) (Plus, adding up all these little gains could soon get us to the point where 3.7 finally beats 2.7 in every benchmark, instead of just most of them, which would kill off an annoying source of FUD.)
![](https://secure.gravatar.com/avatar/db5f70d2f2520ef725839f046bdc32fb.jpg?s=120&d=mm&r=g)
On Sat, 6 Feb 2016 16:21:11 -0800 Andrew Barnert via Python-ideas <python-ideas@python.org> wrote:
To be honest, unlike everyone else on this thread, I'm actually more interested in the simplicity gains than the performance gains.
It is a laudable goal, but what is proposed here wouldn't simplify much if anything, since some opcodes need more than 8 bits of arguments (typically the opcodes that have two logical 8-bit arguments packed in the 16-bit word). So you'll still get variable-sized opcode and need an adequate decoding machinery for them. So perhaps bytecode should be processed by the compile chain in the form of: typedef struct { int opcode; int operand; } instruction_t; instruction_t *bytecode; And then get packed at the end, when creating the code object.
When I'm running CPU-bound code that spends most of its time in CPython itself, it's usually more than fast enough--and, when it isn't, it's almost always an order of magnitude too slow, so adding up all these little 5-10% gains is never going to get us near the point where I can stop using PyPy or Cython or something.
I think that mirrors the experience of many people. Small perf improvements are not useless either, because they can make e.g. command-line tools feel a bit more "snappy".
(Plus, adding up all these little gains could soon get us to the point where 3.7 finally beats 2.7 in every benchmark, instead of just most of them, which would kill off an annoying source of FUD.)
I think that FUD is very tired now and very few people pay attention to it. The only major factor slowing down the 2->3 migration is the effort required for porting. (also, the breadth of new features and improvements in the 3.x line massively offsets small degradations on micro-benchmarks) Regards Antoine.
![](https://secure.gravatar.com/avatar/7e41acaa8f6a0e0f5a7c645e93add55a.jpg?s=120&d=mm&r=g)
On Feb 7, 2016, at 04:53, Antoine Pitrou <solipsis@pitrou.net> wrote:
On Sat, 6 Feb 2016 16:21:11 -0800 Andrew Barnert via Python-ideas <python-ideas@python.org> wrote:
To be honest, unlike everyone else on this thread, I'm actually more interested in the simplicity gains than the performance gains.
It is a laudable goal, but what is proposed here wouldn't simplify much if anything, since some opcodes need more than 8 bits of arguments (typically the opcodes that have two logical 8-bit arguments packed in the 16-bit word). So you'll still get variable-sized opcode and need an adequate decoding machinery for them.
With "wordcode" (at least as I've implemented it), EXTENDED_ARG code doesn't get any simpler (or more complicated)--but the main loop can just read 2 bytes (and then add in any extended arg value) instead of reading 1 or 3, which is definitely simpler. (As implemented in my patch, it's not actually much simpler, because I use all the same macros, I just redefine them, but that's because my initial goal was to change as little as possible--which turns out to be very little. If it seems promising, we can do a slightly larger change that will simplify things more, and possible also improve performance, but I don't want to do that until I've proven it's worth trying.)
So perhaps bytecode should be processed by the compile chain in the form of:
typedef struct { int opcode; int operand; } instruction_t;
instruction_t *bytecode;
And then get packed at the end, when creating the code object.
Yes, there's another thread on that, with a few different variations. This is actually independent of the wordcode change, so I'm doing it on a separate branch. The variation I'm currently working on (based on suggestions by Serhiy) just uses a 32-bit "unpacked bytecode" format (1 byte opcode, 3 bytes arg, no EXTENDED_ARG). The compiler flattens its blocks of instruction objects into unpacked bytecode, gives that to the optimizer (currently that's the peephole optimizer, but this is also where Victor's PEP 511 bytecode transformer hooks fit in), and then packs the bytecode, removing NOPs and fixing up jump targets and lnotab. This makes the peephole optimizer a lot simpler. I've also exposed unpack and pack-and-fixup functions by C API and the dis module, which means bytecode-hacking decorators, import hooks, etc. can work on unpacked bytecode, and usually won't need a third-party module like byteplay to be readable. The big question is whether it's acceptable to limit args to 24 bits. I don't think jumps > 4 million are an issue, but > 255 annotations might be? I'll test it with as much code as possible to see; if not, unpacking to 64 bits (whether with a struct or just an int64) is the obvious answer. Also, it may turn out that bytecode processing is still too painful even with the "unpacked" format. In which case we'll need something different--labels, a tree of blocks similar to what the compiler already has, or something else. But I think this will be sufficient.
(Plus, adding up all these little gains could soon get us to the point where 3.7 finally beats 2.7 in every benchmark, instead of just most of them, which would kill off an annoying source of FUD.)
I think that FUD is very tired now and very few people pay attention to it.
Well, it spawned a thread on -ideas with dozens of replies just a week or two ago...
(also, the breadth of new features and improvements in the 3.x line massively offsets small degradations on micro-benchmarks)
You don't have to convince me; as far as I'm concerned, the barrier was crossed with 3.2.
![](https://secure.gravatar.com/avatar/4c01705256aa2160c1354790e8c154db.jpg?s=120&d=mm&r=g)
On 06.02.16 21:18, Antoine Pitrou wrote:
It sounds like, by 16-bit opcodes, you mean combine the opcode and the argument in a single 16-bit word. But that doesn't solve the issue you want to solve: you still have to decode the argument encoded in the 16-bit word. I don't see where the benefit is.
Current code uses 3 read operations: opcode = *next_instr++; next_instr += 2; oparg = (next_instr[-1]<<8) + next_instr[-2]; Even combining the latter two operations in one read operation give as 10% gain in the microbenchmark (see http://bugs.python.org/issue25823): opcode = *next_instr++; oparg = *(unsigned short *)next_instr; next_instr += 2; With combining the opcode and the argument in always aligned 16-bit word I expect larger gain. word = *(unsigned short *)next_instr; next_instr += 2; opcode = word & 0xff; oparg = word >> 8;
It is generally estimated the overhead of bytecode dispatch and decoding is around 10-30% for CPython. You cannot hope to eliminate that overhead entirely without writing a (JIT or AOT) compiler, so any heroic effort to restructure the current opcode space and structure will at best win 5 to 20% on select benchmarks.
This would be awesome result.
![](https://secure.gravatar.com/avatar/db5f70d2f2520ef725839f046bdc32fb.jpg?s=120&d=mm&r=g)
On Sun, 7 Feb 2016 09:18:18 +0200 Serhiy Storchaka <storchaka@gmail.com> wrote:
On 06.02.16 21:18, Antoine Pitrou wrote:
It sounds like, by 16-bit opcodes, you mean combine the opcode and the argument in a single 16-bit word. But that doesn't solve the issue you want to solve: you still have to decode the argument encoded in the 16-bit word. I don't see where the benefit is.
Current code uses 3 read operations:
opcode = *next_instr++; next_instr += 2; oparg = (next_instr[-1]<<8) + next_instr[-2];
Even combining the latter two operations in one read operation give as 10% gain in the microbenchmark (see http://bugs.python.org/issue25823):
opcode = *next_instr++; oparg = *(unsigned short *)next_instr; next_instr += 2;
So it remains to be seen how much performance is won on actual non-silly benchmarks ;-) Regards Antoine.
![](https://secure.gravatar.com/avatar/5ce43469c0402a7db8d0cf86fa49da5a.jpg?s=120&d=mm&r=g)
On 2016-02-07 07:18, Serhiy Storchaka wrote:
On 06.02.16 21:18, Antoine Pitrou wrote:
It sounds like, by 16-bit opcodes, you mean combine the opcode and the argument in a single 16-bit word. But that doesn't solve the issue you want to solve: you still have to decode the argument encoded in the 16-bit word. I don't see where the benefit is.
Current code uses 3 read operations:
opcode = *next_instr++; next_instr += 2; oparg = (next_instr[-1]<<8) + next_instr[-2];
Even combining the latter two operations in one read operation give as 10% gain in the microbenchmark (see http://bugs.python.org/issue25823):
opcode = *next_instr++; oparg = *(unsigned short *)next_instr; next_instr += 2;
The previous code is big-endian, whereas this code's endianness is processor-dependant.
With combining the opcode and the argument in always aligned 16-bit word I expect larger gain.
word = *(unsigned short *)next_instr; next_instr += 2; opcode = word & 0xff; oparg = word >> 8;
It is generally estimated the overhead of bytecode dispatch and decoding is around 10-30% for CPython. You cannot hope to eliminate that overhead entirely without writing a (JIT or AOT) compiler, so any heroic effort to restructure the current opcode space and structure will at best win 5 to 20% on select benchmarks.
This would be awesome result.
participants (9)
-
Andrew Barnert
-
Antoine Pitrou
-
Antoine Pitrou
-
Chris Angelico
-
MRAB
-
Serhiy Storchaka
-
Sven R. Kunze
-
Victor Stinner
-
Yury Selivanov