Mailman 3 Tracing recursion - pypy-dev

13 Nov 2014

      [Summary: the recursion_and_inlining branch stops us tracing arbitrarily deep
in recursive functions. Comments welcome. If we think this is a good idea, we
have to decide if/how many levels deep we unroll before stopping tracing. My
current thinking is that 7 levels looks about right, though this is heavily
dependent on which benchmarks one considers.]

I recently pushed a branch recursion_and_inlining which aims to tackle the
problem of controlling tracing in the face of recursion. I have been vaguely
aware of an issue here for a while, but never quite managed to nail the
problem down. Then Kevin Modzelewski pointed me to a small benchmark called
"polymorphism" [1] he'd sent to the list back in April which shows the core
problem.

Some simple benchmarking shows that we have a performance problem -- PyPy is
4.4x slower than CPython on this benchmark:

  $ multitime -n 10 python polymorphism.py
  ===> multitime results
  1: python /tmp/polymorphism.py
              Mean        Std.Dev.    Min         Median      Max
  real        1.649       0.058       1.595       1.632       1.763       
  user        1.643       0.057       1.592       1.624       1.756       
  sys         0.002       0.003       0.000       0.002       0.008
  $ multitime -n 10 ./pypy-c-orig polymorphism.py
  ===> multitime results
  1: ./pypy-c-orig /tmp/polymorphism.py
              Mean        Std.Dev.    Min         Median      Max
  real        7.198       0.047       7.131       7.203       7.274       
  user        7.156       0.051       7.076       7.154       7.232       
  sys         0.033       0.011       0.012       0.032       0.048 

The problem is that RPython's naturally aggressive inlining also inlines
recursive functions. So while we don't unroll explicit loops (while/for), we
end up unrolling recursive functions. Sometimes this works out OK, but it's
more likely to end in aborts, or traces which end up with lots of side
traces. Both latter cases are inefficient. As far as I can see, the most
frequent use of aborts is as a heuristic that recursion has been encountered.

There is one way in which recursion doesn't inline, which is if the
recursive function happens to be the start of the trace, in which case (I
think) myjitpl.MIFrame.opimpl_jit_merge_point turns it into a nice loop.
However, if the recursion starts part way into a trace, all bets are off.

Kevin's benchmark has two types of recursion: one in the make_random
function; and another in Poly1.score. make_random is (obviously) directly
recursive; Poly1.score is indirectly recursive. Both cause the problem noted
above.

The recursion_and_inlining branch [2] tries to dynamically spot recursion by
trapping function calls in pyjitpl.MIFrame._opimpl_recursive_call and seeing
if they're currently on the meta-interpreter stack. If they are then it can
choose to stop tracing and turn the recursive call into an actual function
call (it then sets the function as JC_DONT_TRACE_HERE so that, if it hasn't
been traced separately already, it will then be traced, with the recursion
handled by the existing case in opimpl_jit_merge_point). Doing this speeds
Kevin's benchmark up significantly:

  $ multitime -n 10 ./pypy-c-level1 /tmp/polymorphism.py
  ===> multitime results
  1: ./pypy-c-level1 /tmp/polymorphism.py
              Mean        Std.Dev.    Min         Median      Max
  real        0.535       0.013       0.516       0.535       0.560       
  user        0.517       0.018       0.476       0.522       0.544       
  sys         0.016       0.008       0.008       0.012       0.036

We've gone from 4.4x slower than CPython to over 3x faster (i.e. PyPy in
recursion_and_inlining is 13.5x faster than normal PyPy). Which is nice.

However, if I run the branch on the PyPy benchmark suite, we see some
significant slowdown in a handful of benchmarks relative to normal PyPy.
[Full data is attached as results1.txt]. e.g. hexiom2 is 1.3x slower;
raytrace-simple is 3.3x slower; spectral-norm 3.3x slower; sympy_str 1.5x
slower; and telco 4x slower. A few benchmarks speed up in a meaningful way;
most run so quickly that any differences are lost in the noise (IMHO any
benchmark running for 0.1s or less is too short to draw many conclusions
from). The translate test doesn't seem to be impacted by the change either
way. Nevertheless -- and even taking into account that the current benchmark
suite and PyPy optimisations have sort-of evolved hand-in-hand -- the slow
benchmarks aren't good.

So, fortunately, our branch can trivially be extended to identify not just
recursion, but how deeply we've recursed. Put another way, we can choose to
allow a function to be unrolled a fixed number of times before stopping
tracing. How many unrollings should we choose? Well, I've spent time putting
together some rough data (note: this is not perfect benchmarking, but it's
probably good enough). Let's take our slow-coaches (to one decimal place,
because there's quite a bit of noise) and Kevin's benchmark (to 2 decimal
places, because it runs long enough to make that sensible) and see how they
change relative to normal PyPy as we crank up the unrollings:

       #unrollings |  1   |  2   |  3   |  5   |  7   |  10  |
  -----------------+------+------+------+------+------+------+
  hexiom2          | 1.3  | 1.4  | 1.1  | 1.0  | 1.0  | 1.0  |
  raytrace-simple  | 3.3  | 3.1  | 2.8  | 1.4  | 1.0  | 1.0  |
  spectral-norm    | 3.3  | 1.0  | 1.0  | 1.0  | 1.0  | 1.0  |
  sympy_str        | 1.5  | 1.0  | 1.0  | 1.0  | 1.0  | 1.0  |
  telco            | 4    | 2.5  | 2.0  | 1.0  | 1.0  | 1.0  |
  -----------------+------+------+------+------+------+------+
  polymorphism     | 0.07 | 0.07 | 0.07 | 0.07 | 0.08 | 0.09 |

Lower is better in this table. What you can't quite see from that is that
Kevin's benchmark starts getting a teeny bit slower from 5 unrollings upwards
(we'd need another decimal place to see it). All data is attached in
results<x>.txt files if you want to see the raw data.

What I take from the above is that a reasonable guess as to a sensible number
of unrollings seems to be 7. It makes most of the standard benchmarks behave
as they always have, while not doing too much to benchmarks which are
recursive in a different style. Of course, this is all *highly* dependent on
the benchmarks chosen, and if one day we have more benchmarks, we should
revisit this number accordingly.

Note also that this is a patch to *RPython*. It should effect (for better or
worse) all RPython VMs, not just PyPy. e.g. it speeds the Converge compiler
up by a bit over 5% (I haven't tried much more than that).

So there you have it. How would you folks like to proceed? I welcome
comments, as this is not an area of RPython that I've looked at before.
e.g. how many unrollings do you think are sensible? Note that a few tests are
currently broken, because they depend on how many unrollings we choose to do.
Once we've fixed that behaviour, I'll unbreak the tests. I also would very
much welcome tests on other recursive and non-recursive programs/benchmarks
you may have to see how this branch impacts upon performance.

I'd like to say thanks to Kevin for pointing out the problem to me; Carl
Friedrich for helping me understand that tracing recursion was the problem;
and Tanzim Hoque for helping pick into some of the details of pyjitpl.

Laurie

[1] https://raw.githubusercontent.com/dropbox/pyston/master/microbenchmarks/poly...
[2] The key patch is:
      https://bitbucket.org/pypy/pypy/commits/b60064f55316ceb4a3bd784c00a467253a19...

Tracing recursion

Laurence Tratt

tags

participants (1)