Interesting discussion. Just a note:
in Jython, f_lasti is only used to manage exit/entry points, specifically for coroutines/generators, so this is not at the level of bytecode granularity. We also set f_lineno, at the level of Python code of course. HotSpot apparently optimizes this access nicely in any event. (There are other problems with supporting call frames, but this is not one of them it seems.)
Java also offers a debugging interface, which in conjunction with a C++ agent, allows for more fine-grained access to these internals, potentially with lower overhead. This is something Tobias Ivarsson has been exploring.
- Jim
Hi!
This time, I'm trying to answer shortly
Is this the geninterp you're talking about?
http://codespeak.net/pypy/dist/pypy/doc/geninterp.html
Is the geninterpreted version RPython code? I'm almost sure, except
for the """NOT_RPYTHON""" doc string in the geninterpreted source
snippet. I guess it's there because the _source_ of it is not RPython
code.
> I quickly counted the number of lines for the interpreters, excluding theAlso about the amount of Python-specific optimizations you did :-).
> builtin types/functions, and we have 28188 non-empty lines for python, 5376
> for prolog and 1707 for scheme.
> I know that the number of lines does not mean anything, but I think it's a
> good hint about the relative complexities of the languages.
The 90-10 rule should apply anyway, but overhead for obscure features
> I also know
> that being more complex does not necessarily mean that it's impossible to
> write an "efficient" interpreter for it, it's an open question.
might be a problem.
Well, reflection on the call stack can have a big runtime impact, but
that's also present in Smalltalk as far as I know and that can be
handled as well.
Anyway, if Python developers are not able to implement efficient
multithreading in the interpreter because of the excessive performance
impact and they don't decide to drop refcounting, saying "there's
space for optimizations" looks like a safe bet; the source of the idea
is what I've been taught in the course, but I'm also noticing this by
myself.
Yeah, for me as well, plus I'm in the last month of my Erasmus study time :-)
> Thanks for the interesting email, but unfortunately I don't have time to
> answer right now (xmas is coming :-)), I just drop few quick notes:
I would do it together with OProfile benchmarking of indirect branches
>> Ok, just done it, the speedup given by indirect threading seems to be
>> about 18% (see also above). More proper benchmarks are needed though.
> that's interesting, thanks for having tried. I wonder I should try again
> with indirect threading in pypy soon or later.
and of their mispredictions (see the presentation for the OProfile
commands on the Core 2 processor).
They'll be sooner or later. There are a few bugs I should fix, and a
> Btw, are the sources for your project available somewhere?
few other interesting things to do.
But if you are interested in trying to do benchmarking even if it's a
student project, it's not feature complete, and it's likely buggy, I
might publish it earlier.
Yeah, sorry, I remember you wrote geninterp also does other stuff.
>> And as you say in the other mail, the overhead given by dispatch is
>> quite more than 50% (maybe).
> no, it's less.
I wonder why that's not done by your stock interpreter - the CPython
> 50% is the total speedup given by geninterp, which removes
> dispatch overhead but also other things, like storing variables on the stack
frame object has a pointer to a real stack frame; I'm not sure, but I
guess this can increase stack locality since a 32/64-byte cacheline is
much bigger than a typical stack frame and has space for the operand
stack (and needless to say we store locals on the stack, like JVMs
do).
The right benchmark for this, I guess, would be oprofiling cache
misses on a recursion test like factorial or Fibonacci.
Looking at the geninterpreted code, it's amazing that the RPython
> and turning python level flow control into C-level flow control (so e.g.
> loops are expressed as C loops).
translator can do this. Can it also already specialize the interpreter
for each of the object spaces and save the virtual calls?
== About F_LASTI ==
> by "tracking the last bytecode executed" I was really referring to theNot really, I didn't even start studying its proper semantics, but now
> equivalent of f_lasti; are you sure you can store it in a local and still
> implement sys.settrace()?
I know it's worth a try and some additional complexity, at least in an
interpreter with GC. If one write to memory has such a horrible
impact, I'm frightened by the possible impact of refcounting; on the
other side, I wouldn't be surprised if saving the f_lasti write had no
impact on CPython.
My current approach would be that if I can identify code paths where
no code can even look at it (and I guess that most simple opcodes are
such paths), I can copy f_lasti to a global structure only in the
other paths; if f_lasti is just passed to the code tracing routine and
it's called only from the interpreter loop, I could even turn it into
a parameter to that routine (it may be faster with a register calling
convention, but anyway IMHO one gets code which is easier to follow).
Actually, I even wonder if I can just set it when tracing is active,
but since that'd be trivial to do, I guess that when you return from a
call to settrace, you discover (without being able to anticipate it)
that now you need to discover the previous opcode, that's why it's not
already fixed. Still, a local can do even for that, or more
complicated algorithms can do as well (basically, the predecessor is
always known at compile time except for jumps, so only jump opcodes
really need to compute f_lasti).
Regards
--
Paolo Giarrusso
_______________________________________________
pypy-dev@codespeak.net
http://codespeak.net/mailman/listinfo/pypy-dev