On Sun, Dec 21, 2008 at 19:13, Carl Friedrich Bolz <cfbolz@gmx.de> wrote:
Hi Paolo,
Paolo Giarrusso wrote:
after the completion of our student project, I have enough experience to say something more. We wrote in C an interpreter for a Python subset and we could make it much faster than the Python interpreter (50%-60% faster). That was due to the usage of indirect threading, tagged pointers and unboxed integers, and a real GC (a copying one, which is enough for the current benchmarks we have - a generational GC would be more realistic but we didn't manage to do it).
Interesting, but it sounds like you are comparing apples to oranges. What sort of subset of Python are you implementing, i.e. what things don't work? It has been shown time and time again that implementing only a subset of Python makes it possible to get interesting speedups compared to CPython. Then, as more and more features are implemented, the difference gets smaller and smaller. This was true for a number of Python implementations (e.g. IronPython).
Yes, I remember - that's part of the motivation of your project, or something like that. Still, on the example of the arithmetic loop there couldn't be anything to add I feel; I forgot to note that we support, beyond overflow checking, operator overloading; i.e. this is our BINARY_ADD implementation: LABEL_BEGIN(BINARY_ADD); { tagged_el_t a = POP; tagged_el_t b = POP; if (!(unlikely(is_pointer(a)) || unlikely(is_pointer(b)))) { long sum; if (likely(!overflow_checked_add(&sum, TAGGED_EL_RAW_VALUE(a), TAGGED_EL_RAW_VALUE(b)))) { PUSH(raw_long_to_tagged_el(sum)); } else { //XXX box to unlimited precision integers. PUSH(long_to_tagged_el(INT_MAX)); } } else { Object *bp = get_ptr(b); Object *(*add)(Object *, Object *) = bp->type->add; if (add != NULL) { PUSH(ptr_to_tagged_el(add(get_ptr(b), get_ptr(a)))); } else { assert(false); } } } LABEL_END(BINARY_ADD); As you can see, the slowpaths still need a lot of implementation work (we don't support __radd__ for instance), but I tried to ensure that nothing is missing from the fastpath I'm benchmarking. That's why overflow checking is implemented even if overflow is not handled properly (and the given example does _not_ overflow).
I think to get really meaningful comparisons it would be good to modify an existing Python implementation and compare that. Yes, I know this can be a lot of work.
Yes, I'm aware of that, and that's why I plan to test threading in Python.
On your actual techniques used I don't have an opinion. I am rather sure that a copying GC helped performance – it definitely did for PyPy. Tagged pointers make PyPy slower, but then, we tag integers with 1, not with 0. This could be changed, wouldn't even be too much work.
The major problem in changing it is maybe to convert pointers, since in your representation pointers can be used directly (it depends on your macros). But since offseting a pointer is directly supported by the processor, it shouldn't have any major cost (except 1-extra byte of output, I guess, for each pointer access).
About better implementations of the bytecode dispatch I am unsure. Note however, that a while ago we did measurements to see how large the bytecode dispatch overhead is. I don't recall the exact number, but I think it was below 10%. That means that even if you somehow manage to reduce that to no overhead at all, you would still only get 10% performance win.
Sure - but my point, discussed in the mail, and taken from the paper I mentioned, is that in an efficient interpreter, bytecode dispatch becomes much more important than the other activities. Don't believe me, believe that paper (which is quite respected). They report speedups given by threading up to 2x :-), on efficient interpreters. Also, Antonio Cuni made just one example of additional overhead for dispatch: source-line tracking, for exception handling I guess. And as I explained, as far as I understand that's simply a mistake, even compared to CPython - so I think there may be other low-hanging fruits in the interpreter. But I'm still waiting for answers and further explainations on this point. Regards -- Paolo Giarrusso