[pypy-dev] Threaded interpretation (was: Re: compiler optimizations: collecting ideas)

Sun Dec 21 23:49:18 CET 2008

Paolo Giarrusso wrote:
> Hi all,

Hi!

> after the completion of our student project, I have enough experience
> to say something more.
> We wrote in C an interpreter for a Python subset and we could make it
> much faster than the Python interpreter (50%-60% faster). That was due
> to the usage of indirect threading, tagged pointers and unboxed
> integers, and a real GC (a copying one, which is enough for the
> current benchmarks we have - a generational GC would be more realistic
> but we didn't manage to do it).

That's interesting but says little about the benefit of threaded 
interpretation itself, as the speedup could be given by the other 
optimizations.  For example, I suspect that for the benchmark you showed most 
of the speedup is because of tagged pointers and the better GC.

Is it possible to make you eval loop non-threaded? Measuring the difference 
with and without indirect threading could give a good hint of how much you 
gain with it.

What kind of bytecode you use? The same as CPython or a custom one? E.g. I 
found that if we want to handle properly the EXTENDED_ARG CPython opcode it is 
necessary to insert a lot of code before jumping to the next opcode.

Moreover, tagging pointer with 1 helps a lot for numerical benchmarks, but it 
is possible that causes a slowdown for other kind of operations.  Do you have 
non-numerical benchmarks? (though I know that it's hard to get fair 
comparison, because the Python object model is complex and it's not easy to 
write a subset of it in a way that is not cheating)

Finally, as Carl said, it would be nice to know which kind of subset it is. 
E.g. does it support exceptions, sys.settrace() and sys._getframe()?

> In fact, we are at least 50% faster on anything we can run, but also
> on this benchmark, with the usage of unboxed integers and tagged
> pointer (we tag pointers with 1 and integers with 0, like V8 does and
> SELF did, so you can add integers without untagging them):
> 
> def f():
>     y = 0
>     x = 10000000
>     while x:
>         x-=1
>         y+=3
>     return y - 29999873
> 
> And since we do overflow checking (by checking EFLAGS in assembly,
> even if the code could be improved even more, probably), I don't think
> a comparison on this is unfair in any way.

is your subset large enough to handle e.g. pystone? What is the result?

>> 1) in Python a lot of opcodes are quite complex and time-consuming,
> 
> That's wrong for a number of reasons - the most common opcodes are
> probably loads from the constant pool, and loads and stores to/from
> the locals (through LOAD/STORE_FAST). Right now, our hotspot is indeed
> the dispatch after the LOAD_FAST opcode.

if you do benchmarks as the one showed above, I agree with you.  If you 
consider real world applications, unfortunately there is more than LOAD_CONST 
and LOAD_FAST: GETATTR, SETATTR, CALL, etc. are all much more time consuming 
than LOAD_{FAST,CONST}

> That's your problem - threading helps when you spend most of the time
> on dispatch, and efficient interpreters get to that point. 

the question is: is it possible for a full python interpreter to be 
"efficient" as you define it?

>> 2) due to Python's semantics, it's not possible to just jump from one opcode
>> to the next, as we need to do a lot of bookkeeping, like remembering what
>> was the last line executed, etc.
> 
> No, you don't need that, and not even CPython does it. For exception
> handling, just _when an exception is thrown_, 
[cut]

Sorry, I made a typo: it is needed to remember the last *bytecode* executed, 
not the last line.  This is necessary to implement properly sys.settrace().

I never mentioned exception handling, that was your (wrong :-)) guess.

> If the interpreter loop is able to overflow the Icache, that should be
> fought through __builtin_expect first, to give hint for jump
> prediction and lay out slowpaths out-of-line.

I think that Armin tried once to use __builtin_expect, but I don't remember 
the outcome.  Armin, what was it?

> Well, my plan is first to try, at some point, to implant threading
> into the Python interpreter and benchmark the difference - it
> shouldn't take long but it has a low priority currently.

That would be cool, tell us when you have done :-).