[pypy-dev] towards more parallelization in the tracer/optimizer
timo at wakelift.de
Thu Mar 15 22:35:59 CET 2012
I just watched Benjamin Petersons talk on the JIT and one comment that seemed
like just a sidethought caught my attention:
"we are limited by speed. we don't want to spend three seconds optimizing it as
fast as gcc possibly could, because noone wants the program to stop for three
If the raw trace - or a minimally optimized version of it - can be run directly,
why not run it once or thrice while waiting for the optimizer to finish
optimizing the trace fully? - with clever register allocation and whatnot.
I guess that, in most cases, the tracer + optimizer will kick in for a loop in
"the middle" of many iterations, meaning that after the trace is optimized, it
will be run a few more times, maybe hundreds.
Now consider this: If the raw or "minimally optimized" trace can be generated
very quickly and run maybe a couple of times while the full optimizer runs
completely in parallel, then even if the optimizer takes longer than the one
pypy currently uses (because of speed limitations), there will still be a win,
because a few more iterations have already been finished in parallel.
The win will be even more amazing if the tracer kicks in at the very last
iteration of a loop, meaning that the program - that would have stopped for the
optimization previously - can now return from the function or go do some
completely unrelated work, but the next time be very much faster.
So the benefits of this are two-fold: One, pypys jit can afford more expensive
optimizations and Two, scripts run in pypy will experience less stuttering if
there are many different loops hitting the threshold for being considered hot.
Another related idea is to see if there are optimizations that can be run in
parallel on an incomplete trace, so that when the trace finishes, the
optimization is already half-done.
Since Pypy still has the GIL (and the STM fork doesn't have a JIT yet), I
believe it's beneficial to use a second core for miscellaneous tasks, such as
offloading the optimizations. Combined with a more efficient parallel garbage
collector, this could lead to smoother speed changes in pypy, as before you'd
have first slow -> standstill -> blazing fast. In such a parallelized version
you might end up with slow -> unnoticable pause -> faster -> blazing fast.
Thank you :)
PS: there's also an incomplete branch that would allow pypy to use weaker
optimizations on loops that have not run as often as the current trace limit.
The approach here would IMO still be better, because when the trace limit is
hit, there is no longer a second pause.
More information about the pypy-dev