[pypy-dev] Threaded interpretation (was: Re: compiler optimizations: collecting ideas)

Fri Dec 26 05:52:06 CET 2008

Hi!
This time, I'm trying to answer shortly

Is this the geninterp you're talking about?
http://codespeak.net/pypy/dist/pypy/doc/geninterp.html
Is the geninterpreted version RPython code? I'm almost sure, except
for the """NOT_RPYTHON""" doc string in the geninterpreted source
snippet. I guess it's there because the _source_ of it is not RPython
code.

On Wed, Dec 24, 2008 at 22:29, Antonio Cuni <anto.cuni at gmail.com> wrote:
> Paolo Giarrusso wrote:

> I quickly counted the number of lines for the interpreters, excluding the
> builtin types/functions, and we have 28188 non-empty lines for python, 5376
> for prolog and 1707 for scheme.

> I know that the number of lines does not mean anything, but I think it's a
> good hint about the relative complexities of the languages.

Also about the amount of Python-specific optimizations you did :-).

>  I also know
> that being more complex does not necessarily mean that it's impossible to
> write an "efficient" interpreter for it, it's an open question.

The 90-10 rule should apply anyway, but overhead for obscure features
might be a problem.
Well, reflection on the call stack can have a big runtime impact, but
that's also present in Smalltalk as far as I know and that can be
handled as well.
Anyway, if Python developers are not able to implement efficient
multithreading in the interpreter because of the excessive performance
impact and they don't decide to drop refcounting, saying "there's
space for optimizations" looks like a safe bet; the source of the idea
is what I've been taught in the course, but I'm also noticing this by
myself.

> Thanks for the interesting email, but unfortunately I don't have time to
> answer right now (xmas is coming :-)), I just drop few quick notes:

Yeah, for me as well, plus I'm in the last month of my Erasmus study time :-)

>> Ok, just done it, the speedup given by indirect threading seems to be
>> about 18% (see also above). More proper benchmarks are needed though.

> that's interesting, thanks for having tried. I wonder I should try again
> with indirect threading in pypy soon or later.

I would do it together with OProfile benchmarking of indirect branches
and of their mispredictions (see the presentation for the OProfile
commands on the Core 2 processor).

> Btw, are the sources for your project available somewhere?

They'll be sooner or later. There are a few bugs I should fix, and a
few other interesting things to do.
But if you are interested in trying to do benchmarking even if it's a
student project, it's not feature complete, and it's likely buggy, I
might publish it earlier.

>> And as you say in the other mail, the overhead given by dispatch is
>> quite more than 50% (maybe).

> no, it's less.

Yeah, sorry, I remember you wrote geninterp also does other stuff.

> 50% is the total speedup given by geninterp, which removes
> dispatch overhead but also other things, like storing variables on the stack

I wonder why that's not done by your stock interpreter - the CPython
frame object has a pointer to a real stack frame; I'm not sure, but I
guess this can increase stack locality since a 32/64-byte cacheline is
much bigger than a typical stack frame and has space for the operand
stack (and needless to say we store locals on the stack, like JVMs
do).

The right benchmark for this, I guess, would be oprofiling cache
misses on a recursion test like factorial or Fibonacci.

> and turning python level flow control into C-level flow control (so e.g.
> loops are expressed as C loops).

Looking at the geninterpreted code, it's amazing that the RPython
translator can do this. Can it also already specialize the interpreter
for each of the object spaces and save the virtual calls?

== About F_LASTI ==
> by "tracking the last bytecode executed" I was really referring to the
> equivalent of f_lasti; are you sure you can store it in a local and still
> implement sys.settrace()?

Not really, I didn't even start studying its proper semantics, but now
I know it's worth a try and some additional complexity, at least in an
interpreter with GC. If one write to memory has such a horrible
impact, I'm frightened by the possible impact of refcounting; on the
other side, I wouldn't be surprised if saving the f_lasti write had no
impact on CPython.

My current approach would be that if I can identify code paths where
no code can even look at it (and I guess that most simple opcodes are
such paths), I can copy f_lasti to a global structure only in the
other paths; if f_lasti is just passed to the code tracing routine and
it's called only from the interpreter loop, I could even turn it into
a parameter to that routine (it may be faster with a register calling
convention, but anyway IMHO one gets code which is easier to follow).

Actually, I even wonder if I can just set it when tracing is active,
but since that'd be trivial to do, I guess that when you return from a
call to settrace, you discover (without being able to anticipate it)
that now you need to discover the previous opcode, that's why it's not
already fixed. Still, a local can do even for that, or more
complicated algorithms can do as well (basically, the predecessor is
always known at compile time except for jumps, so only jump opcodes
really need to compute f_lasti).

Regards
-- 
Paolo Giarrusso