Fwd: Threaded interpretation (was: Re: compiler optimizations: collecting ideas)

forgot to reply-all ---------- Forwarded message ---------- From: Jim Baker <jbaker@zyasoft.com> Date: Sat, Dec 27, 2008 at 2:08 PM Subject: Re: [pypy-dev] Threaded interpretation (was: Re: compiler optimizations: collecting ideas) To: Paolo Giarrusso <p.giarrusso@gmail.com> I'm only speaking of Jython 2.5, since that's what we're working on, but I believe it was the same for 2.2. (I personally regard 2.5 as more robust, we certainly test it more extensively, although external interfaces may still have some change before our forthcoming release.) We have limited support for tracing in terms of its events interface. It seems usable enough, although we don't produce quite the same traces. Fidelity to things we can't readily and efficiently support is not a goal for Jython. This is especially when they don't bear on running interesting applications. So we don't support ref counting, a GIL, and certain other internal details. We have found that when code does rely on these details that we have been able to push changes into the appropriate projects. Often the same changes are needed for alternative implementations like PyPy; we saw this with Django support. However, we do support frame introspection, the standard Python obj model (including classic classes), and even the rather mixed up unicode/str model. Having said that, we do plan to support a Python bytecode (PBC) VM running in Jython in a future release (possibly 2.5.1). At that point, we may support last_i at the level of the PBC instruction, just like CPython. Reasons for supporting PBC include scenarios where we can't dynamically generate and then load Java bytecode (unsigned applets, for example, or Android), greenlets (which really needs last_i, although it's possible to get a subset of greenlet functionality w/o it), and various components like Jinga2, a templating engines, that directly emit PBC. It's also rather cool to do so I think. - Jim On Sat, Dec 27, 2008 at 12:58 PM, Paolo Giarrusso <p.giarrusso@gmail.com>wrote:
On 26/12/2008, Jim Baker <jbaker@zyasoft.com> wrote:
Interesting discussion. Just a note:
in Jython, f_lasti is only used to manage exit/entry points, specifically for coroutines/generators, so this is not at the level of bytecode granularity.
But is it used to support sys._settrace() ? Also, its past usage (before Python 2.3) for generators is mentioned in CPython source code comments, and last stable Jython release is 2.2.1. What happens in 2.5-beta could be more interesting.
Here's the comment from <src_tree>/Python/ceval.c:
f->f_lasti now refers to the index of the last instruction executed. You might think this was obvious from the name, but this wasn't always true before 2.3! PyFrame_New now sets f->f_lasti to -1 (i.e. the index *before* the first instruction) and YIELD_VALUE doesn't fiddle with f_lasti any more. So this does work. Promise.
We also set f_lineno, at the level of Python code of course.
Hmm, CPython is already able to do this only when needed (i.e. when calling trace functions), "as of Python 2.3":
/* As of 2.3 f_lineno is only valid when tracing is active (i.e. when f_trace is set) -- at other times use PyCode_Addr2Line instead. */ int f_lineno; /* Current line number */
HotSpot apparently optimizes this access nicely in any event. (There are other problems with supporting call frames, but this is not one of them it seems.)
Do you mean you inspected the code generated by the Hotspot native compiler? Hmm, it can't save a write to memory if they are fields of the frame object, unless it can prove that a pointer to the object does not escape the local function through escape analysis (which has been added in Java 1.6 but is said _somewhere_ to notice too few cases).
Java also offers a debugging interface, which in conjunction with a C++ agent, allows for more fine-grained access to these internals, potentially with lower overhead. This is something Tobias Ivarsson has been exploring.
That sounds interesting, even if strange (and not applicable to CPython nor PyPy) - do you want to offer an alternate debug interface or to implement settrace through this?
- Jim
On Thu, Dec 25, 2008 at 9:52 PM, Paolo Giarrusso <p.giarrusso@gmail.com> wrote:
Hi! This time, I'm trying to answer shortly
Is this the geninterp you're talking about?
Is the geninterpreted version RPython code? I'm almost sure, except for the """NOT_RPYTHON""" doc string in the geninterpreted source snippet. I guess it's there because the _source_ of it is not RPython code.
On Wed, Dec 24, 2008 at 22:29, Antonio Cuni <anto.cuni@gmail.com> wrote:
Paolo Giarrusso wrote:
I quickly counted the number of lines for the interpreters, excluding
builtin types/functions, and we have 28188 non-empty lines for
for prolog and 1707 for scheme.
I know that the number of lines does not mean anything, but I think it's a good hint about the relative complexities of the languages.
Also about the amount of Python-specific optimizations you did :-).
I also know that being more complex does not necessarily mean that it's impossible to write an "efficient" interpreter for it, it's an open question.
The 90-10 rule should apply anyway, but overhead for obscure features might be a problem. Well, reflection on the call stack can have a big runtime impact, but that's also present in Smalltalk as far as I know and that can be handled as well. Anyway, if Python developers are not able to implement efficient multithreading in the interpreter because of the excessive performance impact and they don't decide to drop refcounting, saying "there's space for optimizations" looks like a safe bet; the source of the idea is what I've been taught in the course, but I'm also noticing this by myself.
Thanks for the interesting email, but unfortunately I don't have time to answer right now (xmas is coming :-)), I just drop few quick notes:
Yeah, for me as well, plus I'm in the last month of my Erasmus study time :-)
Ok, just done it, the speedup given by indirect threading seems to be about 18% (see also above). More proper benchmarks are needed
http://codespeak.net/pypy/dist/pypy/doc/geninterp.html the python, 5376 though.
that's interesting, thanks for having tried. I wonder I should try
with indirect threading in pypy soon or later.
I would do it together with OProfile benchmarking of indirect branches and of their mispredictions (see the presentation for the OProfile commands on the Core 2 processor).
Btw, are the sources for your project available somewhere?
They'll be sooner or later. There are a few bugs I should fix, and a few other interesting things to do. But if you are interested in trying to do benchmarking even if it's a student project, it's not feature complete, and it's likely buggy, I might publish it earlier.
And as you say in the other mail, the overhead given by dispatch is quite more than 50% (maybe).
no, it's less.
Yeah, sorry, I remember you wrote geninterp also does other stuff.
50% is the total speedup given by geninterp, which removes dispatch overhead but also other things, like storing variables on
again the stack
I wonder why that's not done by your stock interpreter - the CPython frame object has a pointer to a real stack frame; I'm not sure, but I guess this can increase stack locality since a 32/64-byte cacheline is much bigger than a typical stack frame and has space for the operand stack (and needless to say we store locals on the stack, like JVMs do).
The right benchmark for this, I guess, would be oprofiling cache misses on a recursion test like factorial or Fibonacci.
and turning python level flow control into C-level flow control (so
loops are expressed as C loops).
Looking at the geninterpreted code, it's amazing that the RPython translator can do this. Can it also already specialize the interpreter for each of the object spaces and save the virtual calls?
== About F_LASTI ==
by "tracking the last bytecode executed" I was really referring to
e.g. the
equivalent of f_lasti; are you sure you can store it in a local and still implement sys.settrace()?
Not really, I didn't even start studying its proper semantics, but now I know it's worth a try and some additional complexity, at least in an interpreter with GC. If one write to memory has such a horrible impact, I'm frightened by the possible impact of refcounting; on the other side, I wouldn't be surprised if saving the f_lasti write had no impact on CPython.
My current approach would be that if I can identify code paths where no code can even look at it (and I guess that most simple opcodes are such paths), I can copy f_lasti to a global structure only in the other paths; if f_lasti is just passed to the code tracing routine and it's called only from the interpreter loop, I could even turn it into a parameter to that routine (it may be faster with a register calling convention, but anyway IMHO one gets code which is easier to follow).
Actually, I even wonder if I can just set it when tracing is active, but since that'd be trivial to do, I guess that when you return from a call to settrace, you discover (without being able to anticipate it) that now you need to discover the previous opcode, that's why it's not already fixed. Still, a local can do even for that, or more complicated algorithms can do as well (basically, the predecessor is always known at compile time except for jumps, so only jump opcodes really need to compute f_lasti).
Regards -- Paolo Giarrusso
_______________________________________________ pypy-dev@codespeak.net
-- Paolo Giarrusso
-- Jim Baker jbaker@zyasoft.com -- Jim Baker jbaker@zyasoft.com
participants (1)
-
Jim Baker