optimized VM ideas

So, I've been kicking around some ideas for an optimized python VM. I freely admit I'm an amateur at this, but I find the problem of making python code run faster fascinating. My ideas arose from observing that google V8's JIT compiler and type system are much simpler compared to TraceMonkey, but is also faster, and also learning that SquirrelFish is allegedy faster than V8, even though it doesn't compile to native code at all (which V8 and I believe TraceMonkey both do). This leads me to believe that relatively simple, more general concepts in VM design can have a bigger impact then specific, highly complicated JIT solutions, in the context of dynamic languages that can't be easily typed at compile time. So I've thought of a few ideas for a more (new) streamlined python VM: * Simplify the cpython object model as much as possible, while still allowing most of the power of the current model. * Either keep referencing counting, or experiment with some of the newer techniques such as pointer escaping. Object models that exclusively rely on cyclic GC's have many issues and are hard to get right. * Possibly modify the bytecode to be register-based, as in SquirrelFish. Not sure if this is worth it with python code. * Use direct threading (which is basically optimizing switch statements to be only one or two instructions) for the bytecode loop. * Remove string lookups for member access entirely, and replaced with a system of unique identifyers. The idea is you would use a hash in the types to map a member id to an index. Hashing ints is faster then strings, and I've even thought about experimenting with using collapsed arrays instead of hashes. Of course, the design would still need to support string lookups when necessary. I've thought about this a lot, and I think you'd need the same general idea as V8's hidden classes for this to work right (though instead of classes, it'd just be member/unique id lookup maps). I'm not sure I'll have the time to anytime soon to prototype these ideas, but I thought I'd kick them out there and see what people say. Note, I'm in no way suggesting any sort of change to the existing cpython VM (it's way, way too early for that kind of talk). references: v8's design: http://code.google.com/apis/v8/design.html squirrelfish's design: http://blog.mozilla.com/dmandelin/2008/06/03/squirrelfish/ Joe

On Thu, Jan 22, 2009 at 10:59 PM, joe <joeedh@gmail.com> wrote: ...
* Use direct threading (which is basically optimizing switch statements to be only one or two instructions) for the bytecode loop.
fyi - http://bugs.python.org/issue4753 does this (at least when using gcc). The optimization is not about removing instructions per se. but in getting rid of the single switch statement's unpredictable branch that causes modern cpus to stall while determining the correct place to jump rather than speculatively guessing correctly a significant portion of the time. ...
Python strings are already immutable, their hash is computed only once. strings used in code as attributes are interned so that all occurances of that string are the same object making the most common lookups for attribute accesses a table lookup with a pointer equality check.
Anyways, the ideas are great and are definately all things people are considering. This should be an great year for python performance and all language VM performance in general. happy hacking, Greg

On Jan 23, 2009, at 4:59 AM, joe wrote:
I don't think TraceMonkey is slower than V8 I believe the last time I looked TraceMonkey was faster than V8 and it is becoming even faster at each interaction. The way TraceMonkey works reminds me a bit of Psyco, although I might be mixing it with the PyPy JIT. But talking about Psyco, why people don't go help Psyco if all they want is a JIT? It is not like the idea of having a JIT on Python is even new... Psyco was optimizing code even before Webkit/V8 existed.
I believe in the exact oposite, and TraceMonkey is probably one of the proofs of that...
This would modify the language, so it might be interesting, but would generate something which is not Python.
Don't know, but a good GC is way faster than what CPython is doing already, but maybe it is a good idea to explore some others perspectives on this.
Maybe it would help a bit. I don't think it would help more than 10% tops (but I am completely guessing here)
The problem with this is (besides the error someone has already stated about your phrasing) that python has really complex bytecodes, so this would also only gain around 10% and it only works with compilers that accept goto labels which the MSVC for example does not (maybe there are more compilers that also doesn't).
A form of hidden classes is already part of PyPy (but I think that only the jit does this). But you can simply remove string lookups as people can implement special methods to track this on the current Python. As I said before I don't believe changing the semantics of python for the sake of performance is even possible.
If you are not talking about changing CPython VM why not look at Psyco and PyPy? :)
-- Leonardo Santagada santagada at gmail.com

Leonardo Santagada <santagada@...> writes:
The way TraceMonkey works reminds me a bit of Psyco, although I might be mixing it with the PyPy JIT.
I'm not sure it works like Psyco. Read the paper on trace trees, it is quite interesting, and it may be doable to retrofit at least the trace construction part in the current bytecode execution loop. http://www.ics.uci.edu/~franz/Site/pubs-pdf/ICS-TR-06-16.pdf
Three problems with Psyco: 1) it is not 100% compatible with Python semantics 2) it only works on i386 (not even x86-64) 3) it's (roughly) unmaintained
Without changing how the GC works, it's probably possible to improve things a bit. See e.g. http://bugs.python.org/issue4688
I had thought about this and it seems a problem would be reference counting. You have to DECREF a register as soon as it isn't used anymore, which adds some bookkeeping.
Well the bytecodes are not that complex. A bunch of them are completely implemented inline the evaluation loop. Also, since it is a stack machine, there are many trivial bytecodes such as LOAD_FAST, POP_TOP etc. The threaded code patch in http://bugs.python.org/issue4753 has given between 0% and 20% speedups on pybench totals amongst the various posters (and a 5% slowdown in one case). cheers Antoine.

On Thu, Jan 22, 2009 at 10:59 PM, joe <joeedh@gmail.com> wrote: ...
* Use direct threading (which is basically optimizing switch statements to be only one or two instructions) for the bytecode loop.
fyi - http://bugs.python.org/issue4753 does this (at least when using gcc). The optimization is not about removing instructions per se. but in getting rid of the single switch statement's unpredictable branch that causes modern cpus to stall while determining the correct place to jump rather than speculatively guessing correctly a significant portion of the time. ...
Python strings are already immutable, their hash is computed only once. strings used in code as attributes are interned so that all occurances of that string are the same object making the most common lookups for attribute accesses a table lookup with a pointer equality check.
Anyways, the ideas are great and are definately all things people are considering. This should be an great year for python performance and all language VM performance in general. happy hacking, Greg

On Jan 23, 2009, at 4:59 AM, joe wrote:
I don't think TraceMonkey is slower than V8 I believe the last time I looked TraceMonkey was faster than V8 and it is becoming even faster at each interaction. The way TraceMonkey works reminds me a bit of Psyco, although I might be mixing it with the PyPy JIT. But talking about Psyco, why people don't go help Psyco if all they want is a JIT? It is not like the idea of having a JIT on Python is even new... Psyco was optimizing code even before Webkit/V8 existed.
I believe in the exact oposite, and TraceMonkey is probably one of the proofs of that...
This would modify the language, so it might be interesting, but would generate something which is not Python.
Don't know, but a good GC is way faster than what CPython is doing already, but maybe it is a good idea to explore some others perspectives on this.
Maybe it would help a bit. I don't think it would help more than 10% tops (but I am completely guessing here)
The problem with this is (besides the error someone has already stated about your phrasing) that python has really complex bytecodes, so this would also only gain around 10% and it only works with compilers that accept goto labels which the MSVC for example does not (maybe there are more compilers that also doesn't).
A form of hidden classes is already part of PyPy (but I think that only the jit does this). But you can simply remove string lookups as people can implement special methods to track this on the current Python. As I said before I don't believe changing the semantics of python for the sake of performance is even possible.
If you are not talking about changing CPython VM why not look at Psyco and PyPy? :)
-- Leonardo Santagada santagada at gmail.com

Leonardo Santagada <santagada@...> writes:
The way TraceMonkey works reminds me a bit of Psyco, although I might be mixing it with the PyPy JIT.
I'm not sure it works like Psyco. Read the paper on trace trees, it is quite interesting, and it may be doable to retrofit at least the trace construction part in the current bytecode execution loop. http://www.ics.uci.edu/~franz/Site/pubs-pdf/ICS-TR-06-16.pdf
Three problems with Psyco: 1) it is not 100% compatible with Python semantics 2) it only works on i386 (not even x86-64) 3) it's (roughly) unmaintained
Without changing how the GC works, it's probably possible to improve things a bit. See e.g. http://bugs.python.org/issue4688
I had thought about this and it seems a problem would be reference counting. You have to DECREF a register as soon as it isn't used anymore, which adds some bookkeeping.
Well the bytecodes are not that complex. A bunch of them are completely implemented inline the evaluation loop. Also, since it is a stack machine, there are many trivial bytecodes such as LOAD_FAST, POP_TOP etc. The threaded code patch in http://bugs.python.org/issue4753 has given between 0% and 20% speedups on pybench totals amongst the various posters (and a 5% slowdown in one case). cheers Antoine.
participants (4)
-
Antoine Pitrou
-
Gregory P. Smith
-
joe
-
Leonardo Santagada