[pypy-dev] Threaded interpretation (was: Re: compiler optimizations: collecting ideas)

Jim Baker jbaker at zyasoft.com
Fri Dec 26 16:19:36 CET 2008

Interesting discussion. Just a note:

in Jython, f_lasti is only used to manage exit/entry points, specifically
for coroutines/generators, so this is not at the level of bytecode
granularity. We also set f_lineno, at the level of Python code of course.
HotSpot apparently optimizes this access nicely in any event. (There are
other problems with supporting call frames, but this is not one of them it

Java also offers a debugging interface, which in conjunction with a C++
agent, allows for more fine-grained access to these internals, potentially
with lower overhead. This is something Tobias Ivarsson has been exploring.

- Jim

On Thu, Dec 25, 2008 at 9:52 PM, Paolo Giarrusso <p.giarrusso at gmail.com>wrote:

> Hi!
> This time, I'm trying to answer shortly
> Is this the geninterp you're talking about?
> http://codespeak.net/pypy/dist/pypy/doc/geninterp.html
> Is the geninterpreted version RPython code? I'm almost sure, except
> for the """NOT_RPYTHON""" doc string in the geninterpreted source
> snippet. I guess it's there because the _source_ of it is not RPython
> code.
> On Wed, Dec 24, 2008 at 22:29, Antonio Cuni <anto.cuni at gmail.com> wrote:
> > Paolo Giarrusso wrote:
> > I quickly counted the number of lines for the interpreters, excluding the
> > builtin types/functions, and we have 28188 non-empty lines for python,
> 5376
> > for prolog and 1707 for scheme.
> > I know that the number of lines does not mean anything, but I think it's
> a
> > good hint about the relative complexities of the languages.
> Also about the amount of Python-specific optimizations you did :-).
> >  I also know
> > that being more complex does not necessarily mean that it's impossible to
> > write an "efficient" interpreter for it, it's an open question.
> The 90-10 rule should apply anyway, but overhead for obscure features
> might be a problem.
> Well, reflection on the call stack can have a big runtime impact, but
> that's also present in Smalltalk as far as I know and that can be
> handled as well.
> Anyway, if Python developers are not able to implement efficient
> multithreading in the interpreter because of the excessive performance
> impact and they don't decide to drop refcounting, saying "there's
> space for optimizations" looks like a safe bet; the source of the idea
> is what I've been taught in the course, but I'm also noticing this by
> myself.
> > Thanks for the interesting email, but unfortunately I don't have time to
> > answer right now (xmas is coming :-)), I just drop few quick notes:
> Yeah, for me as well, plus I'm in the last month of my Erasmus study time
> :-)
> >> Ok, just done it, the speedup given by indirect threading seems to be
> >> about 18% (see also above). More proper benchmarks are needed though.
> > that's interesting, thanks for having tried. I wonder I should try again
> > with indirect threading in pypy soon or later.
> I would do it together with OProfile benchmarking of indirect branches
> and of their mispredictions (see the presentation for the OProfile
> commands on the Core 2 processor).
> > Btw, are the sources for your project available somewhere?
> They'll be sooner or later. There are a few bugs I should fix, and a
> few other interesting things to do.
> But if you are interested in trying to do benchmarking even if it's a
> student project, it's not feature complete, and it's likely buggy, I
> might publish it earlier.
> >> And as you say in the other mail, the overhead given by dispatch is
> >> quite more than 50% (maybe).
> > no, it's less.
> Yeah, sorry, I remember you wrote geninterp also does other stuff.
> > 50% is the total speedup given by geninterp, which removes
> > dispatch overhead but also other things, like storing variables on the
> stack
> I wonder why that's not done by your stock interpreter - the CPython
> frame object has a pointer to a real stack frame; I'm not sure, but I
> guess this can increase stack locality since a 32/64-byte cacheline is
> much bigger than a typical stack frame and has space for the operand
> stack (and needless to say we store locals on the stack, like JVMs
> do).
> The right benchmark for this, I guess, would be oprofiling cache
> misses on a recursion test like factorial or Fibonacci.
> > and turning python level flow control into C-level flow control (so e.g.
> > loops are expressed as C loops).
> Looking at the geninterpreted code, it's amazing that the RPython
> translator can do this. Can it also already specialize the interpreter
> for each of the object spaces and save the virtual calls?
> == About F_LASTI ==
> > by "tracking the last bytecode executed" I was really referring to the
> > equivalent of f_lasti; are you sure you can store it in a local and still
> > implement sys.settrace()?
> Not really, I didn't even start studying its proper semantics, but now
> I know it's worth a try and some additional complexity, at least in an
> interpreter with GC. If one write to memory has such a horrible
> impact, I'm frightened by the possible impact of refcounting; on the
> other side, I wouldn't be surprised if saving the f_lasti write had no
> impact on CPython.
> My current approach would be that if I can identify code paths where
> no code can even look at it (and I guess that most simple opcodes are
> such paths), I can copy f_lasti to a global structure only in the
> other paths; if f_lasti is just passed to the code tracing routine and
> it's called only from the interpreter loop, I could even turn it into
> a parameter to that routine (it may be faster with a register calling
> convention, but anyway IMHO one gets code which is easier to follow).
> Actually, I even wonder if I can just set it when tracing is active,
> but since that'd be trivial to do, I guess that when you return from a
> call to settrace, you discover (without being able to anticipate it)
> that now you need to discover the previous opcode, that's why it's not
> already fixed. Still, a local can do even for that, or more
> complicated algorithms can do as well (basically, the predecessor is
> always known at compile time except for jumps, so only jump opcodes
> really need to compute f_lasti).
> Regards
> --
> Paolo Giarrusso
> _______________________________________________
> pypy-dev at codespeak.net
> http://codespeak.net/mailman/listinfo/pypy-dev

Jim Baker
jbaker at zyasoft.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pypy-dev/attachments/20081226/eee7b158/attachment.html>

More information about the Pypy-dev mailing list