[pypy-dev] Fwd: Threaded interpretation (was: Re: compiler optimizations: collecting ideas)

Jim Baker jbaker at zyasoft.com
Sat Dec 27 22:08:46 CET 2008


forgot to reply-all

---------- Forwarded message ----------
From: Jim Baker <jbaker at zyasoft.com>
Date: Sat, Dec 27, 2008 at 2:08 PM
Subject: Re: [pypy-dev] Threaded interpretation (was: Re: compiler
optimizations: collecting ideas)
To: Paolo Giarrusso <p.giarrusso at gmail.com>


I'm only speaking of Jython 2.5, since that's what we're working on, but I
believe it was the same for 2.2. (I personally regard 2.5 as more robust, we
certainly test it more extensively, although external interfaces may still
have some change before our forthcoming release.)

We have limited support for tracing in terms of its events interface. It
seems usable enough, although we don't produce quite the same traces.
Fidelity to things we can't readily and efficiently support is not a goal
for Jython. This is especially when they don't bear on running interesting
applications. So we don't support ref counting, a GIL, and certain other
internal details. We have found that when code does rely on these details
that we have been able to push changes into the appropriate projects. Often
the same changes are needed for alternative implementations like PyPy; we
saw this with Django support. However, we do support frame introspection,
the standard Python obj model (including classic classes), and even the
rather mixed up unicode/str model.

Having said that, we do plan to support a Python bytecode (PBC) VM running
in Jython in a future release (possibly 2.5.1). At that point, we may
support last_i at the level of the PBC instruction, just like CPython.
Reasons for supporting PBC include scenarios where we can't dynamically
generate and then load Java bytecode (unsigned applets, for example, or
Android), greenlets (which really needs last_i, although it's possible to
get a subset of greenlet functionality w/o it), and various components like
Jinga2, a templating engines, that directly emit PBC. It's also rather cool
to do so I think.

- Jim


On Sat, Dec 27, 2008 at 12:58 PM, Paolo Giarrusso <p.giarrusso at gmail.com>wrote:

> On 26/12/2008, Jim Baker <jbaker at zyasoft.com> wrote:
> > Interesting discussion. Just a note:
> >
> > in Jython, f_lasti is only used to manage exit/entry points, specifically
> > for coroutines/generators, so this is not at the level of bytecode
> > granularity.
>
> But is it used to support sys._settrace() ? Also, its past usage
> (before Python 2.3) for generators is mentioned in CPython source code
> comments, and last stable Jython release is 2.2.1. What happens in
> 2.5-beta could be more interesting.
>
> Here's the comment from <src_tree>/Python/ceval.c:
>
>           f->f_lasti now refers to the index of the last instruction
>           executed.  You might think this was obvious from the name, but
>           this wasn't always true before 2.3!  PyFrame_New now sets
>           f->f_lasti to -1 (i.e. the index *before* the first instruction)
>           and YIELD_VALUE doesn't fiddle with f_lasti any more.  So this
>           does work.  Promise.
>
> > We also set f_lineno, at the level of Python code of course.
>
> Hmm, CPython is already able to do this only when needed (i.e. when
> calling trace functions), "as of Python 2.3":
>
>    /* As of 2.3 f_lineno is only valid when tracing is active (i.e. when
>       f_trace is set) -- at other times use PyCode_Addr2Line instead. */
>    int f_lineno;               /* Current line number */
>
> > HotSpot apparently optimizes this access nicely in any event. (There are
> > other problems with supporting call frames, but this is not one of them
> it
> > seems.)
>
> Do you mean you inspected the code generated by the Hotspot native
> compiler?
> Hmm, it can't save a write to memory if they are fields of the frame
> object, unless it can prove that a pointer to the object does not
> escape the local function through escape analysis (which has been
> added in Java 1.6 but is said _somewhere_ to notice too few cases).
>
> > Java also offers a debugging interface, which in conjunction with a C++
> > agent, allows for more fine-grained access to these internals,
> potentially
> > with lower overhead. This is something Tobias Ivarsson has been
> exploring.
>
> That sounds interesting, even if strange (and not applicable to
> CPython nor PyPy) - do you want to offer an alternate debug interface
> or to implement settrace through this?
>
> > - Jim
>
> > On Thu, Dec 25, 2008 at 9:52 PM, Paolo Giarrusso <p.giarrusso at gmail.com>
> > wrote:
> > > Hi!
> > > This time, I'm trying to answer shortly
> > >
> > > Is this the geninterp you're talking about?
> > >
> > http://codespeak.net/pypy/dist/pypy/doc/geninterp.html
> > > Is the geninterpreted version RPython code? I'm almost sure, except
> > > for the """NOT_RPYTHON""" doc string in the geninterpreted source
> > > snippet. I guess it's there because the _source_ of it is not RPython
> > > code.
> > >
> > >
> > > On Wed, Dec 24, 2008 at 22:29, Antonio Cuni <anto.cuni at gmail.com>
> wrote:
> > > > Paolo Giarrusso wrote:
> > >
> > >
> > > > I quickly counted the number of lines for the interpreters, excluding
> > the
> > > > builtin types/functions, and we have 28188 non-empty lines for
> python,
> > 5376
> > > > for prolog and 1707 for scheme.
> > >
> > > > I know that the number of lines does not mean anything, but I think
> it's
> > a
> > > > good hint about the relative complexities of the languages.
> > >
> > > Also about the amount of Python-specific optimizations you did :-).
> > >
> > >
> > > >  I also know
> > > > that being more complex does not necessarily mean that it's
> impossible
> > to
> > > > write an "efficient" interpreter for it, it's an open question.
> > >
> > > The 90-10 rule should apply anyway, but overhead for obscure features
> > > might be a problem.
> > > Well, reflection on the call stack can have a big runtime impact, but
> > > that's also present in Smalltalk as far as I know and that can be
> > > handled as well.
> > > Anyway, if Python developers are not able to implement efficient
> > > multithreading in the interpreter because of the excessive performance
> > > impact and they don't decide to drop refcounting, saying "there's
> > > space for optimizations" looks like a safe bet; the source of the idea
> > > is what I've been taught in the course, but I'm also noticing this by
> > > myself.
> > >
> > >
> > > > Thanks for the interesting email, but unfortunately I don't have time
> to
> > > > answer right now (xmas is coming :-)), I just drop few quick notes:
> > >
> > > Yeah, for me as well, plus I'm in the last month of my Erasmus study
> time
> > :-)
> > >
> > >
> > > >> Ok, just done it, the speedup given by indirect threading seems to
> be
> > > >> about 18% (see also above). More proper benchmarks are needed
> though.
> > >
> > > > that's interesting, thanks for having tried. I wonder I should try
> again
> > > > with indirect threading in pypy soon or later.
> > >
> > > I would do it together with OProfile benchmarking of indirect branches
> > > and of their mispredictions (see the presentation for the OProfile
> > > commands on the Core 2 processor).
> > >
> > >
> > > > Btw, are the sources for your project available somewhere?
> > >
> > > They'll be sooner or later. There are a few bugs I should fix, and a
> > > few other interesting things to do.
> > > But if you are interested in trying to do benchmarking even if it's a
> > > student project, it's not feature complete, and it's likely buggy, I
> > > might publish it earlier.
> > >
> > >
> > > >> And as you say in the other mail, the overhead given by dispatch is
> > > >> quite more than 50% (maybe).
> > >
> > > > no, it's less.
> > >
> > > Yeah, sorry, I remember you wrote geninterp also does other stuff.
> > >
> > >
> > > > 50% is the total speedup given by geninterp, which removes
> > > > dispatch overhead but also other things, like storing variables on
> the
> > stack
> > >
> > > I wonder why that's not done by your stock interpreter - the CPython
> > > frame object has a pointer to a real stack frame; I'm not sure, but I
> > > guess this can increase stack locality since a 32/64-byte cacheline is
> > > much bigger than a typical stack frame and has space for the operand
> > > stack (and needless to say we store locals on the stack, like JVMs
> > > do).
> > >
> > > The right benchmark for this, I guess, would be oprofiling cache
> > > misses on a recursion test like factorial or Fibonacci.
> > >
> > >
> > > > and turning python level flow control into C-level flow control (so
> e.g.
> > > > loops are expressed as C loops).
> > >
> > > Looking at the geninterpreted code, it's amazing that the RPython
> > > translator can do this. Can it also already specialize the interpreter
> > > for each of the object spaces and save the virtual calls?
> > >
> > > == About F_LASTI ==
> > >
> > > > by "tracking the last bytecode executed" I was really referring to
> the
> > > > equivalent of f_lasti; are you sure you can store it in a local and
> > still
> > > > implement sys.settrace()?
> > >
> > > Not really, I didn't even start studying its proper semantics, but now
> > > I know it's worth a try and some additional complexity, at least in an
> > > interpreter with GC. If one write to memory has such a horrible
> > > impact, I'm frightened by the possible impact of refcounting; on the
> > > other side, I wouldn't be surprised if saving the f_lasti write had no
> > > impact on CPython.
> > >
> > > My current approach would be that if I can identify code paths where
> > > no code can even look at it (and I guess that most simple opcodes are
> > > such paths), I can copy f_lasti to a global structure only in the
> > > other paths; if f_lasti is just passed to the code tracing routine and
> > > it's called only from the interpreter loop, I could even turn it into
> > > a parameter to that routine (it may be faster with a register calling
> > > convention, but anyway IMHO one gets code which is easier to follow).
> > >
> > > Actually, I even wonder if I can just set it when tracing is active,
> > > but since that'd be trivial to do, I guess that when you return from a
> > > call to settrace, you discover (without being able to anticipate it)
> > > that now you need to discover the previous opcode, that's why it's not
> > > already fixed. Still, a local can do even for that, or more
> > > complicated algorithms can do as well (basically, the predecessor is
> > > always known at compile time except for jumps, so only jump opcodes
> > > really need to compute f_lasti).
> > >
> > > Regards
> > > --
> > > Paolo Giarrusso
> > >
> > >
> > >
> > > _______________________________________________
> > > pypy-dev at codespeak.net
> > >
> > http://codespeak.net/mailman/listinfo/pypy-dev
> > >
>
>
> --
> Paolo Giarrusso
>



-- 
Jim Baker
jbaker at zyasoft.com



-- 
Jim Baker
jbaker at zyasoft.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pypy-dev/attachments/20081227/cad009f1/attachment.html>


More information about the Pypy-dev mailing list