[Python-Dev] Register-based VM for CPython

Sat Nov 17 02:13:25 CET 2012

Hi,

I'm still looking for the best approach to speedup CPython. I found the
following article comparing a stack-based VM to a register-based VM which
announces a speed up between 20% and 40% (depending if the instruction
dispatch uses a dummy switch or computed goto):

http://static.usenix.org/events/vee05/full_papers/p153-yunhe.pdf
Yunhe Shi, David Gregg, Andrew Beatty, M. Anton Ertl, 2005

I tried to implement such register-based instructions for CPython in the
following fork of Python 3.4:
http://hg.python.org/sandbox/registervm/

Contact me if you are interested and/or if you would like to contribute!

--

My implementation is not finished and still experimental. It is only
enabled explictly by calling registervm.optimize_func(func) where
registervm is a new module. This function rewrites the bytecode of the
function inplace. I chose this approach because it was simple to implement,
it is simple to compare stack-based and register-based instructions, and it
is also the approach used in the paper :-)

The major drawback of the register approach (at least of my implementation)
is that it changes the lifetime of objects. Newly created objects are only
"destroyed" at the exit of the function, whereas the stack-based VM
destroys "immediatly" objects (thanks to the reference counter). PyPy has
similar issues with its different garbage collector.

On Linux (with computed goto in ceval.c), there are some interesting
speedups in pybench, up to 25% faster. Such speedup can be explained with
the bytecode. For example, "c = a + b" is compiled to:

   LOAD_FAST            'a'
   LOAD_FAST            'b'
   BINARY_ADD
   STORE_FAST           'c'

The optimized bytecode becomes:

  BINARY_ADD_REG       'c', 'a', 'b'

So 4 operations are replaced with only 1 operation. So instruction
dispatching has a cost, measurable in this microbenchmark.

My implementation is incomplete: if you see "PUSH_REG" or "POP_REG" in the
assembler, your code will be slower than the original bytecode. There is no
register allocator and the optimizer doesn't understand the control flow,
so I expect more performance improvment with a more complete
implementation. Some optimizations are also disabled by default because the
implementation is buggy.

The main changes of registervm are done in Python/ceval.c (implement new
instructions) and Lib/registervm.py (the new module rewriting the
bytecode). Objects/frameobject.c is also modified to allocate registers.

See registervm documentation for more information:
http://hg.python.org/sandbox/registervm/file/tip/REGISTERVM.txt

--

The WPython project is similar to my work (except that it does not use
registers). It tries also to reduce the overhead of instruction dispatch by
using more complex instructions.
http://code.google.com/p/wpython/

Using registers instead of a stack allow to implement more optimizations
(than WPython). For example, it's possible to load constants outside loops
and merge "duplicate constant loads".

I also implemented more aggressive and experimental optimizations (disabled
by default) which may break applications: move loads of attributes and
globals outside of loops, and replace binary operations with inplace
operations. For example, "x=[]; for ...: x.append(...)" is optimized to
something like "x=[]; x_append=x.append; for ...: x_append(...)", and "x =
x + 1" is replaced with "x += 1".

Victor
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-dev/attachments/20121117/3b6d4290/attachment.html>