[pypy-dev] Threaded interpretation (was: Re: compiler optimizations: collecting ideas)

Paolo Giarrusso p.giarrusso at gmail.com
Sun Dec 21 07:06:33 CET 2008


Hi all,
after the completion of our student project, I have enough experience
to say something more.
We wrote in C an interpreter for a Python subset and we could make it
much faster than the Python interpreter (50%-60% faster). That was due
to the usage of indirect threading, tagged pointers and unboxed
integers, and a real GC (a copying one, which is enough for the
current benchmarks we have - a generational GC would be more realistic
but we didn't manage to do it).

While we lack some runtime checks, conditional branches are often
almost for free if you have to do an indirect branch soon and likely
flush the pipeline anyway, especially if just one of the possibilities
lays in the fast path. That means that runtime checks for invalid code
have the cost of a normal instruction and not the cost of a
conditional branch (which can be much more expensive).
I obviously do not expect everybody to believe me on this. So, if you
prefer, ignore this "vaporware" and just consider my proposals.

In fact, we are at least 50% faster on anything we can run, but also
on this benchmark, with the usage of unboxed integers and tagged
pointer (we tag pointers with 1 and integers with 0, like V8 does and
SELF did, so you can add integers without untagging them):

def f():
    y = 0
    x = 10000000
    while x:
        x-=1
        y+=3
    return y - 29999873

And since we do overflow checking (by checking EFLAGS in assembly,
even if the code could be improved even more, probably), I don't think
a comparison on this is unfair in any way.

Also, when adding overflow checking we didn't notice any important slowdown.
I think that's enough to say that we are faster also because of design
choices and implementation quality, and the V8 lead developer, Lars
Bak, our professor, had a similar feeling.

On Mon, Nov 17, 2008 at 15:05, Antonio Cuni <anto.cuni at gmail.com> wrote:
> Paolo Giarrusso wrote:

>> specialized bytecode can be significant, I guess, only if the
>> interpreter is really fast (either a threaded one, or a code-copying
>> one). Is the PyPy interpreter threaded?

> sometime ago I tried to measure if/how much we can gain with a threaded
> interpreter. I manually modified the produced C code to make the main loop
> threaded, but we didn't gain anything; I think there are three possible
> reasons:

> 1) in Python a lot of opcodes are quite complex and time-consuming,

That's wrong for a number of reasons - the most common opcodes are
probably loads from the constant pool, and loads and stores to/from
the locals (through LOAD/STORE_FAST). Right now, our hotspot is indeed
the dispatch after the LOAD_FAST opcode.

And those opcodes are quite simple. The hotspot used to be LOAD_CONST,
but to optimize it, I just had to save the pointer to the constant
pool. Instead of accessing current_function->co_consts, I save it into
a local, which can be cached into a register even through function
calls:

        raw_array *current_function_consts = current_function->co_consts;

That made a 4% speedup on our benchmarks. And we didn't write a
microbenchmark for LOAD_CONST. This is the handling code:

                LABEL_BEGIN(LOAD_CONST);
                PUSH(raw_array_skip_header(current_function_consts)[parameter]);
                LABEL_END(LOAD_CONST);

(LABEL_BEGIN/LABEL_END bracket all opcode cases, they expand to labels
for indirect threading and to the dispatch code itself).

After seeing this, you can see why one extra memory load makes a difference.

> so the
> time spent to dispatch to them is a little percentage of the total time
> spent for the execution

That's your problem - threading helps when you spend most of the time
on dispatch, and efficient interpreters get to that point. Threading
helps even Prolog interpreters, like Yap, so it can help even for
Python. Also, according to measurements "The Structure and Performance
of Efficient Interpreters", efficient interpreters tend to execute no
more than 30 instructions per opcode on their RISC machine simulator.
The Python interpreter could still maybe be considered efficient,
since it uses ~30 instructions per opcode, but that's on a CISC
processor (i.e. my laptop), so the numbers are not directly
comparable. I'll maybe measure the ratio, for the interpreters they
benchmark, again on my laptop, to do a  fair comparison.

> 2) due to Python's semantics, it's not possible to just jump from one opcode
> to the next, as we need to do a lot of bookkeeping, like remembering what
> was the last line executed, etc.

No, you don't need that, and not even CPython does it. For exception
handling, just _when an exception is thrown_, you can take the current
opcode index and find the last executed line with a lookup table.
Their source code has a slowpath for tracing support, but that's one
just conditional jump to a slowpath, which is for free in any case if
the predictor guesses it right (and it should).

So, can you make some more examples?

> This means that the trampolines at the end
> of each opcode contains a lot code duplication, leading to a bigger main
> loop, with possibly bad effects on the cache (didn't measure this, though)

If the interpreter loop is able to overflow the Icache, that should be
fought through __builtin_expect first, to give hint for jump
prediction and lay out slowpaths out-of-line.

> 3) it's possible that I did something wrong, so in that case my measurements
> are completely useless :-).  If anyone wants to try again, it cannot hurt.

Well, my plan is first to try, at some point, to implant threading
into the Python interpreter and benchmark the difference - it
shouldn't take long but it has a low priority currently.

Regards
-- 
Paolo Giarrusso



More information about the Pypy-dev mailing list