[pypy-dev] Another idea for improving warm-up times

Sat Feb 7 18:16:15 CET 2015

Hi all,

Here's an idea that came up today on irc (thanks the_drow_) while
discussing again saving the generated assembler into files and
reloading it on the next run.  As you know we categorize this idea as
"can never work", but now I think that there is a variant that could
work, unless I'm missing something.

What *cannot* work is storing in the files any information about
Python-level data, like "this piece of assembler assumes that there is
a module X with a class Y with a method Z".  I'm not going to repeat
again the multiple problems with that (see
http://pypy.readthedocs.org/en/latest/faq.html#couldn-t-the-jit-dump-and-reload-already-compiled-machine-code).
However, it might be possible to save enough lower-level information
to avoid the problem completely.

The idea would be that when we're about the enter tracing mode and
there is a saved file, we use a "fast-path":

* we find a *likely* candidate from the saved file, based on some
explicitly-provided way to hash the Python code objects, for example
(it doesn't have to be a guaranteed match)

* this likely saved trace comes with a recording of all the *green*
guards that we originally did (both promotions and things that just
depend on purely green inputs).  (This means extra work for making
sure we save this information in the first place.)

* we run it like in the blackhole interpreter, but checking the result
of the green guards against the recorded ones

* we also take all green values that we get this way and pass them as
constants to the next step (see below, "**").

This means that we generalize (and lower-level-ize) the vague idea "a
module X with a class Y with a method Z" to be instead a series of
guards that need to give the same results as before.  They would
automatically check some subset of the new interpreter's state by
comparing it against the old's --- but only as much as the actual loop
happens to need.  For example, if we had in the (old) normal trace a
guard_value(p12, <constant pointer>), then of course it makes no sense
to record the old interpreter's constant pointer, which will change.
But it makes sense to record *what* was really deduced from this
constant pointer, i.e. all the green getfields and getarrayitems we
did.  And for example, if it was a PyCode object, we would record the
green switch that we did on the integer value that we got in the old
interpreter (which is the next opcode), even though that's all green.
That's the real condition: that we would follow the same path by
constant-folding the decisions on the green variables.

So, to finish the new interpreter's reloading: if the checking done
above passes, the next step is a fast-path through the assembler.  We
"just" need to reload the saved assembler as a sequence of bytes, and
fix all constants there.  To continue the example above, if a piece of
assembler was generated from the instruction guard_value(p12,
<constant pointer>), then the saved file must contain enough
information so that we know we must replace this old constant
pointer's value in the assembler with the new constant pointer's value
recorded above (at "**").

Overall, this would result in a much faster warm-up: faster tracing,
and no optimization nor regular assembler generation at all --- only a
very quick assembler reloading and fixing step.

There are of course complications from the fact that we don't simply
record loops, but bridges.  They might be seen in a different order in
the new process, so that when we are in the checking mode, we might
run the start of the loop, but then jump into the bridge --- even
though the loop was not fully seen so far.  This is not impossible to
implement by reloading the complete loop+bridge, but making the tail
of the loop invalid until we really run into it (with an extra
temporary guard).  And I'm sure that unrolling will somehow come with
its lot of funniness, as usual.

However, does it sound reasonable at all, or am I missing something else?

A bientôt,

Armin.