[pypy-issue] Issue #2901: Disable inlining globally hurts performance (pypy/pypy)

Sat Oct 13 06:39:17 EDT 2018

New issue 2901: Disable inlining globally hurts performance
https://bitbucket.org/pypy/pypy/issues/2901/disable-inlining-globally-hurts

Tinho Lee:

`call_assembler_r` is found to be a heavy function when I profile my project with the jit log, because of the parameters and result packing/unpacking. It seems that inlining would help to solve the problem, however, inlining is not work as expected sometimes.

I have construct two test cases, one for recursive call, another for overwriting __getattr__. All the tests are run with environment variable PYPY_GC_MIN=20G, in order to eliminate the influence of gc. 

Here is the basic functions defined for the recursive call test case

```
#!python

def rec(n):
	return 1 if n <= 1 else rec(n - 1) + n

def a():
	rec(1)
	rec(1)
	rec(1)
	rec(1)
	rec(1)

def b():
	rec(100)
	rec(100)
	rec(100)

def test_a():
	start = time.time()
	for n in xrange(10000000):
		a()
	print 'a', time.time() - start

def test_b():
	start = time.time()
	for n in xrange(100000):
		b()
	print 'b', time.time() - start
```

and the test code

```
#!python

print 'warm up'
test_a()
test_b()

print 'test'
test_a()
test_b()
```

The order of warm up is important in this case. Here is the result if `test_a` is called before `test_b`:

```
warm up
a 0.0164198875427
b 0.196079015732
test
a 0.0130050182343
b 0.189415931702
```

And the result if `test_b` is called before `test_a`:

```
warm up
b 0.198947906494
a 1.29781603813
test
a 1.27277779579
b 0.186721801758
```

The performance of function `a` is greatly hurts by the order of warm up. I have found some clues from the jit log. If `test_a` is warmed up first, `rec` would be inlined in the function `a`, everything works fine. However, when `test_b` is warmed up first, `rec` would be detected as recursive function call and inline is disabled, which also affect the performance of `a`. 

The implementation lies in `_opimpl_recursive_call` of pyjitpl.py. 

```
#!python

if count >= memmgr.max_unroll_recursion:
    # This function is recursive and has exceeded the
    # maximum number of unrollings we allow. We want to stop
    # inlining it further and to make sure that, if it
    # hasn't happened already, the function is traced
    # separately as soon as possible.
    if have_debug_prints():
        loc = targetjitdriver_sd.warmstate.get_location_str(greenboxes)
        debug_print("recursive function (not inlined):", loc)
    warmrunnerstate.dont_trace_here(greenboxes)
```

And the same behavior is found for the non-recursive function call. `--jit trace_limit=300` is set as the `pypy` startup parameter to make the test easily. Here is the basic definitions

```
#!python

class X(object):
	def __getattr__(self, name):
		return len(name)

def a(x):
	return x.a + x.b + x.c

def b(x):
	a = 1
	a += 1
	a += 2
	a += 3
	a += 4
	a += 5
	a += 6
	a += 7
	a += 8
	a += 9
	return x.a + x.b + x.c

def test_a(x):
	start = time.time()
	for n in xrange(1000000):
		a(x)
	print 'a', time.time() - start

def test_b(x):
	start = time.time()
	for n in xrange(1000000):
		b(x)
	print 'b', time.time() - start
```

and the test case

```
#!python

x = X()

print 'warm up'
test_a(x)
test_b(x)

print 'test'
test_a(x)
test_b(x)
```

The order of warm-up also have significant influence on the result. 

When `test_a` is called before `test_b`, `__getattr__` is inlined for better performance.

```
warm up
a 0.0307860374451
b 0.114766836166
test
a 0.0264270305634
b 0.110279083252
```

When `test_b` is called before `test_a`, `__getattr__` is disabled for inlining, which result in worse performance for `a`. 

```
warm up
b 0.115270137787
a 0.111267089844
test
a 0.10728096962
b 0.108850002289
```

I came up with some explanations for the behavior from the code `blackhole_if_trace_too_long` in pyjitpl.py. If `test_b` is called first, trace too long would be detected since a small trace limit is set. Then the biggest function in `b`, which is `__getattr__`, is set to be noninlineable. 

```
#!python

def blackhole_if_trace_too_long(self):
    warmrunnerstate = self.jitdriver_sd.warmstate
    if (self.history.length() > warmrunnerstate.trace_limit or
            self.history.trace_tag_overflow()):
        jd_sd, greenkey_of_huge_function = self.find_biggest_function()
        self.staticdata.stats.record_aborted(greenkey_of_huge_function)
        self.portal_trace_positions = None
        if greenkey_of_huge_function is not None:
            jd_sd.warmstate.disable_noninlinable_function(
                greenkey_of_huge_function)
            self.aborted_tracing_jitdriver = jd_sd
            self.aborted_tracing_greenkey = greenkey_of_huge_function
            if self.current_merge_points:
                jd_sd = self.jitdriver_sd
                greenkey = self.current_merge_points[0][0][:jd_sd.num_green_args]
                warmrunnerstate.JitCell.trace_next_iteration(greenkey)
        raise SwitchToBlackhole(Counters.ABORT_TOO_LONG)
```

It seems that the disable inlining have a global influence on all the tracing path. For the cases above, inlining should be enabled in `a` and be disabled in `b`, but the truth is inlining is disabled in both `a` and `b`. 

Is there any proposal could help to improve the performance?