In the middle of recent discussions about Python performance, it was discussed to change the Python bytecode. Serhiy proposed to reuse MicroPython short bytecode to reduce the disk space and reduce the memory footprint.
Demur Rumed proposes a different change to use a regular bytecode using 16-bit units: an instruction has always one 8-bit argument, it's zero if the instruction doesn't have an argument:
According to benchmarks, it looks faster:
IMHO it's a nice enhancement: it makes the code simpler. The most interesting change is made in Python/ceval.c:
- if (HAS_ARG(opcode)) - oparg = NEXTARG(); + oparg = NEXTARG();
This code is the very hot loop evaluating Python bytecode. I expect that removing a conditional branch here can reduce the CPU branch misprediction.
I reviewed first versions of the change, and IMHO it's almost ready to be merged. But I would prefer to have a review from a least a second core reviewer.
Can someone please review the change?
The side effect of wordcode is that arguments in 0..255 now uses 2 bytes per instruction instead of 3, so it also reduce the size of bytecode for the most common case.
Larger argument, 16-bit argument (0..65,535), now uses 4 bytes instead of 3. Arguments are supported up to 32-bit: 24-bit uses 3 units (6 bytes), 32-bit uses 4 units (8 bytes). MAKE_FUNCTION uses 16-bit argument for keyword defaults and 24-bit argument for annotations. Other common instruction known to use large argument are jumps for bytecode longer than 256 bytes.
Right now, ceval.c still fetchs opcode and then oparg with two 8-bit instructions. Later, we can discuss if it would be possible to ensure that the bytecode is always aligned to 16-bit in memory to fetch the two bytes using a uint16_t* pointer.
Maybe we can overallocate 1 byte in codeobject.c and align manually the memory block if needed. Or ceval.c should maybe copy the code if it's not aligned?
Raymond Hettinger proposes something like that, but it looks like there are concerns about non-aligned memory accesses:
The cost of non-aligned memory accesses depends on the CPU architecture, but it can raise a SIGBUS on some arch (MIPS and SPARC?).