[pypy-dev] Threaded interpretation (was: Re: compiler optimizations: collecting ideas)

Paolo Giarrusso p.giarrusso at gmail.com
Tue Dec 23 06:16:32 CET 2008


On Sun, Dec 21, 2008 at 23:49, Antonio Cuni <anto.cuni at gmail.com> wrote:
> Paolo Giarrusso wrote:

Thanks for the interesting mail. It really goes into detail about the
issues I raised, and it does suggest what I needed: stuff needing
support inside the interpreter fast path, if just for unlikely
branches. I started the required coding, but it will take a while
(also because I have little time, given Christmas). But I'm answering
with some preliminary results.

First, I'd answer this question:

>> That's your problem - threading helps when you spend most of the time
>> on dispatch, and efficient interpreters get to that point.

> the question is: is it possible for a full python interpreter to be
> "efficient" as you define it?

Well, my guess is "if Prolog, Scheme and so on can, why can't Python"?

I know this is an interesting question, and I and my colleague, during
the past two months, tried to get an answer. And it wasn't really two
months for two people, I would say that much less time was actually
spent. So for instance we don't have support for modules, exception
handling, and the basic support for object (through Self/V8 style
maps) is not yet complete (but almost).

Then, for a couple of points, it isn't possible. Unboxed integers work
better with the Python 3.x specs - if you have them and you still want
a 32-bit integer, which is boxed, to be an int instead of a long, you
get some problems.

Also, the semantics of Python destructors are too restrictive for a
proper GC, and they can create unreclaimable reference cycles (when
all objects have destructors). In fact, Java, unlike C++, has instead
finalizers - the difference is that you can finalize an object after
finalizing the objects it points to. That's why I prefer not to say
"Python finalizers" like the official docs.
However, for both cases I feel that the semantics are wrong. Even
finalizers are a bad idea - it's interesting that .NET is even more
misguided than Java about them (and that's the opinion of Hans Boehm,
not mine).

So, those semantics fit well only reference counting - and reference
counting, in turn, makes support for multithreading impossible (the
patches for "free threading", without the global interpreter lock,
gave a 2x slowdown I think). Somebody on the python-dev ML realized
that CPython simply gets both things wrong (reference counting and
threading support), but most people went to say "oh, but people
shouldn't use threading anyway, it's too complicated to get right,
they should use multiprocess stuff..." . That was one of the most
(negatively) astonishing things I saw.

Oh, no, the worst was something like "oh, we should allow using more
interpreters in the same process, but we should move lots of statics
to the interpreter context, and it's a lot of work so it's not worth
it". Since I've participated to a community where bigger patches are
committed every week, this looks like plain laziness; it could be
because much less people are paid to work on Python, or maybe because
the general community mood is that one.

In fact, one of the reason I started this project _at all_ so early is
that the Python community clearly shows _ignorance_ about years of VM
research and little attention about VM performance.

== Benchmarks about dispatch overhead ==

And while you don't look like that, the mention of "tracking the last
line executed" seemed quite weird.
And even tracking the last bytecode executed looks weird, even if it
is not maybe. I'm inspecting CPython's Python/ceval.c, and the
overhead for instruction dispatch looks comparable.
The only real problem I'm getting right now is committing the last
bytecode executed to memory. If I store it into a local, I have no
problem at all, if I store it into the interpreter context, it's a
store to memory, so it hurts performance a lot - I'm still wondering
about the right road to go.

Runtime of the while benchmark, with 10x more iterations, user time:

baseline  (now, with tracking of f_lasti inside a local): ~4.6s
add ->f_lasti tracking _into memory_: ~5.6s
disable indirect threading: ~5.6s
both slowdown together: ~7s
Python 2.5.1: ~11.2s

Other benchmarks use fatter opcodes, so the impact is still visible,
but is lower relative to the overall time.
Even if f_lasti must be updated at all the time, could it be stored in
memory just in some occasions? Is it needed during the call to
settrace(), or only to the next opcode after that, i.e. after the
return of the interpreter loop? The second case would be enough.
Storing the variable into memory just for that would be simply
excellent for performance, maybe even in the stock Python interpreter.

Note: all this is on a 64bit box, with 16 registers; we're also faster
on a 32-bit one, and in some cases the difference wrt. Python is
bigger on 32-bit (on the GC stress-test, for instance, due to smaller
pointers to update, and on the recursion test, maybe for the same
reason when using the stack, since the stack element size is 64bit on
64bit machines), while on the others the extra registers (I guess)
give advantage wrt. Python to the 64bit interpreter (obviously, the
comparison is always with Python on the same machine and
architecture). See slide 8 of pres.pdf.

> That's interesting but says little about the benefit of threaded
> interpretation itself, as the speedup could be given by the other
> optimizations.  For example, I suspect that for the benchmark you showed
> most of the speedup is because of tagged pointers and the better GC.

> Is it possible to make you eval loop non-threaded? Measuring the difference
> with and without indirect threading could give a good hint of how much you
> gain with it.

Ok, just done it, the speedup given by indirect threading seems to be
about 18% (see also above). More proper benchmarks are needed though.
And as you say in the other mail, the overhead given by dispatch is
quite more than 50% (maybe). Am I correct in assuming that
"geninterpret"ing _basically_ pastes the opcode handlers together? I
guess with your infrastructure, you can even embed easily the opcode
parameters inside the handlers, it's just a trivial partial evaluation
- I tried to apply code copying of machine code to my interpreter, but
I would have had to keep parameter fetching separate (getting the
output I needed from GCC was not easy - I could make code copying work
just for an early prototype).
About keeping variables into the stack instead that in the frame,
that's even stranger to me, given this argument.

> What kind of bytecode you use? The same as CPython or a custom one? E.g. I
> found that if we want to handle properly the EXTENDED_ARG CPython opcode it
> is necessary to insert a lot of code before jumping to the next opcode.

We handle a subset of the Python's opcodes, but what you say means
that we'll have to handle EXTENDED_ARG properly, because of our
mission. Oh, done, I would say (tests needed though).
I think that "lot of code" can only refer to the requirement of
tracking the last bytecode executed, but I'll have to look at your and
CPython sources for that.

That seems to have an impact of a few percent points (like 4-5% slower
than before), but I have to do real benchmarking for these stuff. I
fear I'll have to calculate confidence interval to give any meaningful
number (I never needed that when comparing us with CPython), because
the impact seems under statistical noise.

However, just having the code laid out the wrong way can make it much
worse. As soon as I wrote it, it almost doubled the runtime of the
arithmetic benchmark (while it had a much lower impact on the other).
It seems it was because GCC decided that HAS_ARG(opcode) had become
unlikely, adding "likely" fixed it back.

In fact, one of the reason we are so fast is that we started with a
small interpreter and tracked down every performance regression as
soon as it happened. We lost half of our performance, at some point,
just because the address of the program_counter local variable
escaped, so that var had to be allocated on the stack and not inside a
register.

== Opcode set ==
We really wanted to have our own one opcode set, but we didn't manage.
In particular, I was refreshed when I read about your LOAD_METHOD
optimization. Also, invoking __init__ is a real mess, Java handles it
much more nicely, and I don't see the point of not using JVM-style
2-step class construction:

new Class
dup <--
<arguments>
invokespecial Class.<init>

Existence of bound methods prevents simply pushing the receiver as the
first argument on the stack; however, I can't see a similar argument
for class construction, even if __init__ can be a bound method.
Generating similar bytecode would avoid the need to rotate the stack.
In Python, one would get something like:

ALLOC_INSTANCE Class // <--- new opcode
DUP_TOP //Copy the value to return
LOAD_METHOD __init__
<arguments>
INVOKE_METHOD
POP_TOP //Remove the None returned by the constructor

However, the dispatch cost of the additional DUP_TOP and POP_TOP might
be a problem. I guess such bytecode will be for sure more efficient to
JIT-compile, but for actual interpretation, benchmarks would be
needed.

> Moreover, tagging pointer with 1 helps a lot for numerical benchmarks, but
> it is possible that causes a slowdown for other kind of operations.  Do you
> have non-numerical benchmarks? (though I know that it's hard to get fair
> comparison, because the Python object model is complex and it's not easy to
> write a subset of it in a way that is not cheating)

I agree with that. The choice was done following the one done by V8
people, and was already tested in the Self system, but I'll answer you
with those tests. We have a couple of recursion test, with a recursive
function and a recursive method (but this one is unfair). And a test
with list access - we are ~50% faster even there.

For the Python object model, we have the opposite problem, because
it's much more code to write and optimize: our INPLACE_ADD is equal to
BINARY_ADD, so given a list l, a loop with "l += [1]" is able to kill
us completely (since for now we create a list each time). Indeed, "l =
l + [1]" is just as slow with our interp, but gets much slower in
Python. Even here we are still faster, but most of the time is spent
adding lists, so we are just 15% faster. I guess we'd get more
advantage if INPLACE_ADD was implemented, and that's not a big deal.

> Finally, as Carl said, it would be nice to know which kind of subset it is.

I just need to find the time to write it down, sorry for the wait. I'm
attaching our final presentation, the report we wrote for the course,
and the _benchmarks_ we have. But basically, the subset we implemented
is _much_ smaller.

> E.g. does it support exceptions, sys.settrace() and sys._getframe()?

Exception handling is not supported, but stack unwind should be
possible without impact on the fastpath.

We didn't know about the other two ones.
I saw code which I guess is for handling settrace in CPython, but
that's on the slowpath, I think I should be able to easily simulate
the impact of that by adding one unlikely conditional jump. Is it
required to track the last bytecode even when tracing is not active,
as asked above?

About _getframe(), I'm quite curious about how it works and how your
support for it works. From the next mail you sent, it seems that you
construct the needed frames during the function calls (and I saw some
documentation about them). Well, in the Smalltalk age, developers
found that constructing the frame object only when reflective access
to the frame is needed is the only reasonable solution, if performance
is important.
_If_ the frame object returned is modifiable, I still think one can
intercept field modifications and modify the original stack to track
the modified object.

In fact, to get the current performance, we don't allocate a locals
dictionary unless needed; and currently it seems that doing it only
when STORE_NAME is invoked gives the correct semantics (and if that
were wrong, computing a NEEDS_LOCAL_DICT flag during compilation would
be trivial). Otherwise, the recursion test would be slower than the
one in Python.

Also, we didn't bother with Unicode, and that's complicated to handle
- even Unicode handling in V8 has problems, and string operations are
slower than in TraceMonkey.

> is your subset large enough to handle e.g. pystone? What is the result?

We didn't have the time to even download pystone.

>>> 1) in Python a lot of opcodes are quite complex and time-consuming,
>>
>> That's wrong for a number of reasons - the most common opcodes are
>> probably loads from the constant pool, and loads and stores to/from
>> the locals (through LOAD/STORE_FAST). Right now, our hotspot is indeed
>> the dispatch after the LOAD_FAST opcode.

> if you do benchmarks as the one showed above, I agree with you.  If you
> consider real world applications, unfortunately there is more than
> LOAD_CONST and LOAD_FAST: GETATTR, SETATTR, CALL, etc. are all much more
> time consuming than LOAD_{FAST,CONST}

Yep, but well, if you want to implement a low-level algorithm in
Python, at some point you'll write such code in an inner loop. But
yes, we've been working also on attribute fetch - we've been
implementing V8 maps just for that (we worked on that for a couple of
days, and I'm fixing the remaining bugs - but it's not really a lot of
code).

For GET/SET_ATTR, I will need to do "inline caching" of lookup results
(even if the term doesn't really apply to an interpreter). Making it
faster than dictionary lookup will be hard however, since our strings
cache their hash code.

>>> 2) due to Python's semantics, it's not possible to just jump from one
>>> opcode
>>> to the next, as we need to do a lot of bookkeeping, like remembering what
>>> was the last line executed, etc.
>>
>> No, you don't need that, and not even CPython does it. For exception
>> handling, just _when an exception is thrown_,

> [cut]

> Sorry, I made a typo: it is needed to remember the last *bytecode* executed,
> not the last line.  This is necessary to implement properly sys.settrace().

> I never mentioned exception handling, that was your (wrong :-)) guess.

Indeed, and I was quite surprised to see something like that. I'm
quite happier now, but see above for comment on its impact.

>> If the interpreter loop is able to overflow the Icache, that should be
>> fought through __builtin_expect first, to give hint for jump
>> prediction and lay out slowpaths out-of-line.

> I think that Armin tried once to use __builtin_expect, but I don't remember
> the outcome.  Armin, what was it?

On suitable benchmarks, for me it's easy to see that the wrong
expectation can give horrible results (not always, but often). And
sometimes GCC does get the wrong result. But once, on a short opcode,
fixing code laid out the wrong way didn't make any important
difference. I guess that to execute one simple opcode one maybe
doesn't manage to fill a 20-stage pipeline before of the next
dispatch, which will flush it. So, if you have 10 or 20 instructions,
it doesn't matter that much - 20 cycles are needed anyway. In that
case, fixing the layout made the same instructions execute faster
probably, but they had to wait more time. Also, I do guess that
dynamic branch prediction could fix the wrong code layout (not
perfectly though - a predicted taken branch costs more than one not
taken).

Note: to give you actual numbers, BINARY_ADD uses 16 assembler
instructions, and then 11 are needed for dispatching, plus other 7 if
the next opcode has a parameter; the slow paths use some more
instructions, in case of EXTENDED_ARG (but they are laid out-of-line).
DUP_TOP is three assembly instructions, POP_TOP is one, and the stack
pointer is kept in RBX.

Note 2: by suitable benchmarks, I don't mean unfair ones. It would be
easy (and unfair) to give hints for the exact code you have in your
benchmark.

>> Well, my plan is first to try, at some point, to implant threading
>> into the Python interpreter and benchmark the difference - it
>> shouldn't take long but it has a low priority currently.

> That would be cool, tell us when you have done :-).

For sure I will let you know, just don't hold your breath on that :-).

Regards
-- 
Paolo Giarrusso
-------------- next part --------------
A non-text attachment was scrubbed...
Name: pres.pdf
Type: application/pdf
Size: 56344 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/pypy-dev/attachments/20081223/c4fcfdf7/attachment.pdf>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: report.pdf.bz2
Type: application/x-bzip2
Size: 105640 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/pypy-dev/attachments/20081223/c4fcfdf7/attachment.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: bench.tar.bz2
Type: application/x-bzip2
Size: 2549 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/pypy-dev/attachments/20081223/c4fcfdf7/attachment-0001.bin>


More information about the Pypy-dev mailing list