[pypy-dev] towards more parallelization in the tracer/optimizer

Thu Mar 15 22:35:59 CET 2012

Hello,

I just watched Benjamin Petersons talk on the JIT and one comment that seemed 
like just a sidethought caught my attention:

"we are limited by speed. we don't want to spend three seconds optimizing it as 
fast as gcc possibly could, because noone wants the program to stop for three 
seconds."

If the raw trace - or a minimally optimized version of it - can be run directly, 
why not run it once or thrice while waiting for the optimizer to finish 
optimizing the trace fully? - with clever register allocation and whatnot.

I guess that, in most cases, the tracer + optimizer will kick in for a loop in 
"the middle" of many iterations, meaning that after the trace is optimized, it 
will be run a few more times, maybe hundreds.

Now consider this: If the raw or "minimally optimized" trace can be generated 
very quickly and run maybe a couple of times while the full optimizer runs 
completely in parallel, then even if the optimizer takes longer than the one 
pypy currently uses (because of speed limitations), there will still be a win, 
because a few more iterations have already been finished in parallel.

The win will be even more amazing if the tracer kicks in at the very last 
iteration of a loop, meaning that the program - that would have stopped for the 
optimization previously - can now return from the function or go do some 
completely unrelated work, but the next time be very much faster.

So the benefits of this are two-fold: One, pypys jit can afford more expensive 
optimizations and Two, scripts run in pypy will experience less stuttering if 
there are many different loops hitting the threshold for being considered hot.

Another related idea is to see if there are optimizations that can be run in 
parallel on an incomplete trace, so that when the trace finishes, the 
optimization is already half-done.

Since Pypy still has the GIL (and the STM fork doesn't have a JIT yet), I 
believe it's beneficial to use a second core for miscellaneous tasks, such as 
offloading the optimizations. Combined with a more efficient parallel garbage 
collector, this could lead to smoother speed changes in pypy, as before you'd 
have first slow -> standstill -> blazing fast. In such a parallelized version 
you might end up with slow -> unnoticable pause -> faster -> blazing fast.

Thank you :)
   - Timo

PS: there's also an incomplete branch that would allow pypy to use weaker 
optimizations on loops that have not run as often as the current trace limit. 
The approach here would IMO still be better, because when the trace limit is 
hit, there is no longer a second pause.