Byte code arguments from two to one byte: did anyone try this?

I tried to find any research on this subject, but I couldn't find any, so I'll be daring and vulnerable and just try it out to see what your thoughts are. I single stepped a simple loop in Python to see where the efficiency bottlenecks are. I was impressed by the optimizations already in there, but I still dare to suggest an optimization that from my estimates might shave off a few cycles, speeding up Python about 5%. The idea is simple: change the byte code argument values from two bytes to one. Implications are: - code changes are relatively simple, see below - fewer memory reads, which are becoming more and more expensive - saves three instructions for every opcode with args (i.e. most of them) Code changes are, as far as I could find: compile.c: assemble_emit must produce extended opcodes for all cases of more than 8 bits instead of 16 ceval.c: NEXTARG and PEEKARG need adjustment EXTENDED_ARG needs adjustment (this will be a four byte instruction, which is ugly, I agree) peephole.c: GETARG, SETARG, need adjustment also GETJUMPTGT, CODESIZE routine tuple_of_constants, fold_binops_on_constants, PyCode_Optimize are dependent on instruction length, which will be 2 instead of 3 (search for the digit 3 will find all cases, as far as I checked) you probably will have to write a macro for codestr[i+3] there is a check for code length >32700, but I think this one might stay, maybe if a few extra checks are added. dis: minor adjustments Estimation of speed impact: about 80% of the instructions seem to have an argument, and I never saw an opcode >255 while looking at bytecode, so they are probably not frequent. The NEXTARG macro expands on my Macbook to: mov -408(%ebp),%edx (next_instr) movzbl 2(%edx),%eax (*second byte) shl $0x8,%eax (*shift) movzbl 1(%edx),%edx (first byte) add %edx,%eax (*combine) and the starred instructions will vanish. The main loop is approximately 40 instructions, so a saving of three instructions is significant. I don't dare to claim 3/40 = 7.5% savings, but I think 5% may be realistic. Did anyone try this already? If not, I might take up the gauntlet and try it myself, but I never did this before... - Jurjen PS I also saw that some scratch variables, mainly v and x, are carefull stored back in memory by the compiler and the end of the big interpreter loop, while their value isn't used anymore, of course. A few carefully placed braces might tell the compiler how useless this is and save another few percent.

Jurjen N.E. Bos, 31.01.2011 10:17:
I single stepped a simple loop in Python to see where the efficiency bottlenecks are.
What version of CPython did you try that with? The latest py3k branch? Stefan

Jurjen N.E. Bos wrote:
I was impressed by the optimizations already in there, but I still dare to suggest an optimization that from my estimates might shave off a few cycles, speeding up Python about 5%. The idea is simple: change the byte code argument values from two bytes to one.
Interesting. Have you seem Cesare Di Mauro's WPython project, which takes the opposite strategy? http://code.google.com/p/wpython2/ -- Steven

On 1/31/2011 5:31 AM, Steven D'Aprano wrote:
Jurjen N.E. Bos wrote:
I was impressed by the optimizations already in there, but I still dare to suggest an optimization that from my estimates might shave off a few cycles, speeding up Python about 5%. The idea is simple: change the byte code argument values from two bytes to one.
Interesting. Have you seem Cesare Di Mauro's WPython project, which takes the opposite strategy?
The two strategies could be mixed. Some 'word codes' could consist of a bytecode + byte arg, and others a real word code. Maybe WPython does that already. Might end up being slower though. -- Terry Jan Reedy

2011/1/31 Terry Reedy <tjreedy@udel.edu>
On 1/31/2011 5:31 AM, Steven D'Aprano wrote:
Jurjen N.E. Bos wrote:
I was impressed by the optimizations already in there, but I still dare to suggest an optimization that from my estimates might shave off a few cycles, speeding up Python about 5%. The idea is simple: change the byte code argument values from two bytes to one.
Interesting. Have you seem Cesare Di Mauro's WPython project, which takes the opposite strategy?
The two strategies could be mixed. Some 'word codes' could consist of a bytecode + byte arg, and others a real word code. Maybe WPython does that already. Might end up being slower though.
-- Terry Jan Reedy
Yes, WPython already does it ( http://wpython2.googlecode.com/files/Beyond%20Bytecode%20-%20A%20Wordcode-ba...) , but on average it was faster (pag. 28). Cesare
participants (5)
-
Cesare Di Mauro
-
Jurjen N.E. Bos
-
Stefan Behnel
-
Steven D'Aprano
-
Terry Reedy