
On 2016-02-07 07:18, Serhiy Storchaka wrote:
On 06.02.16 21:18, Antoine Pitrou wrote:
It sounds like, by 16-bit opcodes, you mean combine the opcode and the argument in a single 16-bit word. But that doesn't solve the issue you want to solve: you still have to decode the argument encoded in the 16-bit word. I don't see where the benefit is.
Current code uses 3 read operations:
opcode = *next_instr++; next_instr += 2; oparg = (next_instr[-1]<<8) + next_instr[-2];
Even combining the latter two operations in one read operation give as 10% gain in the microbenchmark (see http://bugs.python.org/issue25823):
opcode = *next_instr++; oparg = *(unsigned short *)next_instr; next_instr += 2;
The previous code is big-endian, whereas this code's endianness is processor-dependant.
With combining the opcode and the argument in always aligned 16-bit word I expect larger gain.
word = *(unsigned short *)next_instr; next_instr += 2; opcode = word & 0xff; oparg = word >> 8;
It is generally estimated the overhead of bytecode dispatch and decoding is around 10-30% for CPython. You cannot hope to eliminate that overhead entirely without writing a (JIT or AOT) compiler, so any heroic effort to restructure the current opcode space and structure will at best win 5 to 20% on select benchmarks.
This would be awesome result.