Hi,<div><br></div><div>I'm still looking for the best approach to speedup CPython. I found the following article comparing a stack-based VM to a register-based VM which announces a speed up between 20% and 40% (depending if the instruction dispatch uses a dummy switch or computed goto):</div>


<div><br></div><div><a href="http://static.usenix.org/events/vee05/full_papers/p153-yunhe.pdf">http://static.usenix.org/events/vee05/full_papers/p153-yunhe.pdf</a><br></div><div><span style="color:rgb(0,0,0);font-family:sans-serif;font-size:13px;line-height:19px;text-align:justify">Yunhe Shi, David Gregg, Andrew Beatty, M. Anton Ertl, 2005</span><br>


</div><div><span style="color:rgb(0,0,0);font-family:sans-serif;font-size:13px;line-height:19px;text-align:justify"><br></span></div><div><span style="color:rgb(0,0,0);font-family:sans-serif;font-size:13px;line-height:19px;text-align:justify">I tried to implement such register-based instructions for CPython in the following fork of Python 3.4:</span></div>


<div><a href="http://hg.python.org/sandbox/registervm/">http://hg.python.org/sandbox/registervm/</a><span style="color:rgb(0,0,0);font-family:sans-serif;font-size:13px;line-height:19px;text-align:justify"><br></span></div>


<div><br></div><div>Contact me if you are interested and/or if you would like to contribute!<br></div><div><br></div><div>--</div><div><br></div><div>My implementation is not finished and still experimental. It is only enabled explictly by calling registervm.optimize_func(func) where registervm is a new module. This function rewrites the bytecode of the function inplace. I chose this approach because it was simple to implement, it is simple to compare stack-based and register-based instructions, and it is also the approach used in the paper :-)</div>


<div><br></div><div>The major drawback of the register approach (at least of my implementation) is that it changes the lifetime of objects. Newly created objects are only "destroyed" at the exit of the function, whereas the stack-based VM destroys "immediatly" objects (thanks to the reference counter). PyPy has similar issues with its different garbage collector.</div>


<div><br></div><div>On Linux (with computed goto in ceval.c), there are some interesting speedups in pybench, up to 25% faster. Such speedup can be explained with the bytecode. For example, "c = a + b" is compiled to:</div>


<div><br></div><div><div><div>   LOAD_FAST            'a' </div><div>   LOAD_FAST            'b' </div><div>   BINARY_ADD            </div><div>   STORE_FAST           'c' </div></div></div><div><br>


</div><div>The optimized bytecode becomes:</div><div><br></div><div><div>  BINARY_ADD_REG       'c', 'a', 'b' </div></div><div><br></div><div>So 4 operations are replaced with only 1 operation. So instruction dispatching has a cost, measurable in this microbenchmark.</div>


<div><br></div><div>My implementation is incomplete: if you see "PUSH_REG" or "POP_REG" in the assembler, your code will be slower than the original bytecode. There is no register allocator and the optimizer doesn't understand the control flow, so I expect more performance improvment with a more complete implementation. Some optimizations are also disabled by default because the implementation is buggy.</div>


<div><br></div><div>The main changes of registervm are done in Python/ceval.c (implement new instructions) and Lib/registervm.py (the new module rewriting the bytecode). Objects/frameobject.c is also modified to allocate registers.<br>


</div><div><br></div><div>See registervm documentation for more information:</div><div><a href="http://hg.python.org/sandbox/registervm/file/tip/REGISTERVM.txt">http://hg.python.org/sandbox/registervm/file/tip/REGISTERVM.txt</a><br>


</div><div><br></div><div>--</div><div><br></div><div>The WPython project is similar to my work (except that it does not use registers). It tries also to reduce the overhead of instruction dispatch by using more complex instructions.</div>


<div><a href="http://code.google.com/p/wpython/">http://code.google.com/p/wpython/</a></div><div><br></div><div>Using registers instead of a stack allow to implement more optimizations (than WPython). For example, it's possible to load constants outside loops and merge "duplicate constant loads".</div>


<div><br></div><div>I also implemented more aggressive and experimental optimizations (disabled by default) which may break applications: move loads of attributes and globals outside of loops, and replace binary operations with inplace operations. For example, "x=[]; for ...: x.append(...)" is optimized to something like "x=[]; x_append=x.append; for ...: x_append(...)", and "x = x + 1" is replaced with "x += 1".</div>


<div><br></div><div>Victor</div>