[pypy-dev] Threaded interpretation (was: Re: compiler optimizations: collecting ideas)

Sun Dec 21 19:41:54 CET 2008

On Sun, Dec 21, 2008 at 19:13, Carl Friedrich Bolz <cfbolz at gmx.de> wrote:
> Hi Paolo,
>
> Paolo Giarrusso wrote:
>> after the completion of our student project, I have enough experience
>> to say something more.
>> We wrote in C an interpreter for a Python subset and we could make it
>> much faster than the Python interpreter (50%-60% faster). That was due
>> to the usage of indirect threading, tagged pointers and unboxed
>> integers, and a real GC (a copying one, which is enough for the
>> current benchmarks we have - a generational GC would be more realistic
>> but we didn't manage to do it).
>
> Interesting, but it sounds like you are comparing apples to oranges.
> What sort of subset of Python are you implementing, i.e. what things
> don't work? It has been shown time and time again that implementing only
> a subset of Python makes it possible to get interesting speedups
> compared to CPython. Then, as more and more features are implemented,
> the difference gets smaller and smaller. This was true for a number of
> Python implementations (e.g. IronPython).

Yes, I remember - that's part of the motivation of your project, or
something like that. Still, on the example of the arithmetic loop
there couldn't be anything to add I feel; I forgot to note that we
support, beyond overflow checking, operator overloading; i.e. this is
our BINARY_ADD implementation:

                LABEL_BEGIN(BINARY_ADD);
                {
                        tagged_el_t a = POP;
                        tagged_el_t b = POP;
                        if (!(unlikely(is_pointer(a)) ||
unlikely(is_pointer(b)))) {
                                long sum;
                                if (likely(!overflow_checked_add(&sum,
TAGGED_EL_RAW_VALUE(a), TAGGED_EL_RAW_VALUE(b)))) {
                                        PUSH(raw_long_to_tagged_el(sum));
                                } else {
                                        //XXX box to unlimited
precision integers.
                                        PUSH(long_to_tagged_el(INT_MAX));
                                }
                        } else {
                                Object *bp = get_ptr(b);
                                Object *(*add)(Object *, Object *) =
bp->type->add;
                                if (add != NULL) {

PUSH(ptr_to_tagged_el(add(get_ptr(b), get_ptr(a))));
                                } else {
                                        assert(false);
                                }
                        }
                }
                LABEL_END(BINARY_ADD);

As you can see, the slowpaths still need a lot of implementation work
(we don't support __radd__ for instance), but I tried to ensure that
nothing is missing from the fastpath I'm benchmarking. That's why
overflow checking is implemented even if overflow is not handled
properly (and the given example does _not_ overflow).

> I think to get really meaningful comparisons it would be good to modify
> an existing Python implementation and compare that. Yes, I know this can
> be a lot of work.

Yes, I'm aware of that, and that's why I plan to test threading in Python.

> On your actual techniques used I don't have an opinion. I am rather sure
> that a copying GC helped performance – it definitely did for PyPy.
> Tagged pointers make PyPy slower, but then, we tag integers with 1, not
> with 0. This could be changed, wouldn't even be too much work.

The major problem in changing it is maybe to convert pointers, since
in your representation pointers can be used directly (it depends on
your macros). But since offseting a pointer is directly supported by
the processor, it shouldn't have any major cost (except 1-extra byte
of output, I guess, for each pointer access).

> About better implementations of the bytecode dispatch I am unsure. Note
> however, that a while ago we did measurements to see how large the
> bytecode dispatch overhead is. I don't recall the exact number, but I
> think it was below 10%. That means that even if you somehow manage to
> reduce that to no overhead at all, you would still only get 10%
> performance win.

Sure - but my point, discussed in the mail, and taken from the paper I
mentioned, is that in an efficient interpreter, bytecode dispatch
becomes much more important than the other activities. Don't believe
me, believe that paper (which is quite respected). They report
speedups given by threading up to 2x :-), on efficient interpreters.

Also, Antonio Cuni made just one example of additional overhead for
dispatch: source-line tracking, for exception handling I guess. And as
I explained, as far as I understand that's simply a mistake, even
compared to CPython - so I think there may be other low-hanging fruits
in the interpreter. But I'm still waiting for answers and further
explainations on this point.

Regards
-- 
Paolo Giarrusso