[Summary: the recursion_and_inlining branch stops us tracing arbitrarily deep in recursive functions. Comments welcome. If we think this is a good idea, we have to decide if/how many levels deep we unroll before stopping tracing. My current thinking is that 7 levels looks about right, though this is heavily dependent on which benchmarks one considers.] I recently pushed a branch recursion_and_inlining which aims to tackle the problem of controlling tracing in the face of recursion. I have been vaguely aware of an issue here for a while, but never quite managed to nail the problem down. Then Kevin Modzelewski pointed me to a small benchmark called "polymorphism" [1] he'd sent to the list back in April which shows the core problem. Some simple benchmarking shows that we have a performance problem -- PyPy is 4.4x slower than CPython on this benchmark: $ multitime -n 10 python polymorphism.py ===> multitime results 1: python /tmp/polymorphism.py Mean Std.Dev. Min Median Max real 1.649 0.058 1.595 1.632 1.763 user 1.643 0.057 1.592 1.624 1.756 sys 0.002 0.003 0.000 0.002 0.008 $ multitime -n 10 ./pypy-c-orig polymorphism.py ===> multitime results 1: ./pypy-c-orig /tmp/polymorphism.py Mean Std.Dev. Min Median Max real 7.198 0.047 7.131 7.203 7.274 user 7.156 0.051 7.076 7.154 7.232 sys 0.033 0.011 0.012 0.032 0.048 The problem is that RPython's naturally aggressive inlining also inlines recursive functions. So while we don't unroll explicit loops (while/for), we end up unrolling recursive functions. Sometimes this works out OK, but it's more likely to end in aborts, or traces which end up with lots of side traces. Both latter cases are inefficient. As far as I can see, the most frequent use of aborts is as a heuristic that recursion has been encountered. There is one way in which recursion doesn't inline, which is if the recursive function happens to be the start of the trace, in which case (I think) myjitpl.MIFrame.opimpl_jit_merge_point turns it into a nice loop. However, if the recursion starts part way into a trace, all bets are off. Kevin's benchmark has two types of recursion: one in the make_random function; and another in Poly1.score. make_random is (obviously) directly recursive; Poly1.score is indirectly recursive. Both cause the problem noted above. The recursion_and_inlining branch [2] tries to dynamically spot recursion by trapping function calls in pyjitpl.MIFrame._opimpl_recursive_call and seeing if they're currently on the meta-interpreter stack. If they are then it can choose to stop tracing and turn the recursive call into an actual function call (it then sets the function as JC_DONT_TRACE_HERE so that, if it hasn't been traced separately already, it will then be traced, with the recursion handled by the existing case in opimpl_jit_merge_point). Doing this speeds Kevin's benchmark up significantly: $ multitime -n 10 ./pypy-c-level1 /tmp/polymorphism.py ===> multitime results 1: ./pypy-c-level1 /tmp/polymorphism.py Mean Std.Dev. Min Median Max real 0.535 0.013 0.516 0.535 0.560 user 0.517 0.018 0.476 0.522 0.544 sys 0.016 0.008 0.008 0.012 0.036 We've gone from 4.4x slower than CPython to over 3x faster (i.e. PyPy in recursion_and_inlining is 13.5x faster than normal PyPy). Which is nice. However, if I run the branch on the PyPy benchmark suite, we see some significant slowdown in a handful of benchmarks relative to normal PyPy. [Full data is attached as results1.txt]. e.g. hexiom2 is 1.3x slower; raytrace-simple is 3.3x slower; spectral-norm 3.3x slower; sympy_str 1.5x slower; and telco 4x slower. A few benchmarks speed up in a meaningful way; most run so quickly that any differences are lost in the noise (IMHO any benchmark running for 0.1s or less is too short to draw many conclusions from). The translate test doesn't seem to be impacted by the change either way. Nevertheless -- and even taking into account that the current benchmark suite and PyPy optimisations have sort-of evolved hand-in-hand -- the slow benchmarks aren't good. So, fortunately, our branch can trivially be extended to identify not just recursion, but how deeply we've recursed. Put another way, we can choose to allow a function to be unrolled a fixed number of times before stopping tracing. How many unrollings should we choose? Well, I've spent time putting together some rough data (note: this is not perfect benchmarking, but it's probably good enough). Let's take our slow-coaches (to one decimal place, because there's quite a bit of noise) and Kevin's benchmark (to 2 decimal places, because it runs long enough to make that sensible) and see how they change relative to normal PyPy as we crank up the unrollings: #unrollings | 1 | 2 | 3 | 5 | 7 | 10 | -----------------+------+------+------+------+------+------+ hexiom2 | 1.3 | 1.4 | 1.1 | 1.0 | 1.0 | 1.0 | raytrace-simple | 3.3 | 3.1 | 2.8 | 1.4 | 1.0 | 1.0 | spectral-norm | 3.3 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | sympy_str | 1.5 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | telco | 4 | 2.5 | 2.0 | 1.0 | 1.0 | 1.0 | -----------------+------+------+------+------+------+------+ polymorphism | 0.07 | 0.07 | 0.07 | 0.07 | 0.08 | 0.09 | Lower is better in this table. What you can't quite see from that is that Kevin's benchmark starts getting a teeny bit slower from 5 unrollings upwards (we'd need another decimal place to see it). All data is attached in results<x>.txt files if you want to see the raw data. What I take from the above is that a reasonable guess as to a sensible number of unrollings seems to be 7. It makes most of the standard benchmarks behave as they always have, while not doing too much to benchmarks which are recursive in a different style. Of course, this is all *highly* dependent on the benchmarks chosen, and if one day we have more benchmarks, we should revisit this number accordingly. Note also that this is a patch to *RPython*. It should effect (for better or worse) all RPython VMs, not just PyPy. e.g. it speeds the Converge compiler up by a bit over 5% (I haven't tried much more than that). So there you have it. How would you folks like to proceed? I welcome comments, as this is not an area of RPython that I've looked at before. e.g. how many unrollings do you think are sensible? Note that a few tests are currently broken, because they depend on how many unrollings we choose to do. Once we've fixed that behaviour, I'll unbreak the tests. I also would very much welcome tests on other recursive and non-recursive programs/benchmarks you may have to see how this branch impacts upon performance. I'd like to say thanks to Kevin for pointing out the problem to me; Carl Friedrich for helping me understand that tracing recursion was the problem; and Tanzim Hoque for helping pick into some of the details of pyjitpl. Laurie [1] https://raw.githubusercontent.com/dropbox/pyston/master/microbenchmarks/poly... [2] The key patch is: https://bitbucket.org/pypy/pypy/commits/b60064f55316ceb4a3bd784c00a467253a19...
participants (1)
-
Laurence Tratt