Paolo Giarrusso wrote:
Hi all,
Hi!
after the completion of our student project, I have enough experience to say something more. We wrote in C an interpreter for a Python subset and we could make it much faster than the Python interpreter (50%-60% faster). That was due to the usage of indirect threading, tagged pointers and unboxed integers, and a real GC (a copying one, which is enough for the current benchmarks we have - a generational GC would be more realistic but we didn't manage to do it).
That's interesting but says little about the benefit of threaded interpretation itself, as the speedup could be given by the other optimizations. For example, I suspect that for the benchmark you showed most of the speedup is because of tagged pointers and the better GC. Is it possible to make you eval loop non-threaded? Measuring the difference with and without indirect threading could give a good hint of how much you gain with it. What kind of bytecode you use? The same as CPython or a custom one? E.g. I found that if we want to handle properly the EXTENDED_ARG CPython opcode it is necessary to insert a lot of code before jumping to the next opcode. Moreover, tagging pointer with 1 helps a lot for numerical benchmarks, but it is possible that causes a slowdown for other kind of operations. Do you have non-numerical benchmarks? (though I know that it's hard to get fair comparison, because the Python object model is complex and it's not easy to write a subset of it in a way that is not cheating) Finally, as Carl said, it would be nice to know which kind of subset it is. E.g. does it support exceptions, sys.settrace() and sys._getframe()?
In fact, we are at least 50% faster on anything we can run, but also on this benchmark, with the usage of unboxed integers and tagged pointer (we tag pointers with 1 and integers with 0, like V8 does and SELF did, so you can add integers without untagging them):
def f(): y = 0 x = 10000000 while x: x-=1 y+=3 return y - 29999873
And since we do overflow checking (by checking EFLAGS in assembly, even if the code could be improved even more, probably), I don't think a comparison on this is unfair in any way.
is your subset large enough to handle e.g. pystone? What is the result?
1) in Python a lot of opcodes are quite complex and time-consuming,
That's wrong for a number of reasons - the most common opcodes are probably loads from the constant pool, and loads and stores to/from the locals (through LOAD/STORE_FAST). Right now, our hotspot is indeed the dispatch after the LOAD_FAST opcode.
if you do benchmarks as the one showed above, I agree with you. If you consider real world applications, unfortunately there is more than LOAD_CONST and LOAD_FAST: GETATTR, SETATTR, CALL, etc. are all much more time consuming than LOAD_{FAST,CONST}
That's your problem - threading helps when you spend most of the time on dispatch, and efficient interpreters get to that point.
the question is: is it possible for a full python interpreter to be "efficient" as you define it?
2) due to Python's semantics, it's not possible to just jump from one opcode to the next, as we need to do a lot of bookkeeping, like remembering what was the last line executed, etc.
No, you don't need that, and not even CPython does it. For exception handling, just _when an exception is thrown_, [cut]
Sorry, I made a typo: it is needed to remember the last *bytecode* executed, not the last line. This is necessary to implement properly sys.settrace(). I never mentioned exception handling, that was your (wrong :-)) guess.
If the interpreter loop is able to overflow the Icache, that should be fought through __builtin_expect first, to give hint for jump prediction and lay out slowpaths out-of-line.
I think that Armin tried once to use __builtin_expect, but I don't remember the outcome. Armin, what was it?
Well, my plan is first to try, at some point, to implant threading into the Python interpreter and benchmark the difference - it shouldn't take long but it has a low priority currently.
That would be cool, tell us when you have done :-).