Help me understand why PyPy is so slow in my benchmark

Hello, I'm working on a project that involves benchmarking CPython and PyPy across a few benchmarks I've devised, but PyPy is so much slower than CPython I feel I must have made some sort of mistake. One major headscratcher is that CPython runtimes hold stable while PyPy appears to get worse as the benchmark progresses. I've included an excerpt of the benchmark below that shows the issue, if I can attach files that won't be lost I can attach a full copy; I don't believe the string operations are the issue because the micro-tuning tips suggest the JIT reduces concatenation overhead in scenarios like these (if I'm reading the page right). [1] def recurse(num): if num >= 1000: return "M" + recurse(num - 1000) elif num >= 900: return "CM" + recurse(num - 900) elif num >= 500: return "D" + recurse(num - 500) elif num >= 400: return "CD" + recurse(num - 400) elif num >= 100: return "C" + recurse(num - 100) elif num >= 90: return "XC" + recurse(num - 90) elif num >= 50: return "L" + recurse(num - 50) elif num >= 40: return "XL" + recurse(num - 40) elif num >= 10: return "X" + recurse(num - 10) elif num >= 9: return "IX" + recurse(num - 9) elif num >= 5: return "V" + recurse(num - 5) elif num >= 4: return "IV" + recurse(num - 4) elif num >= 1: return "I" + recurse(num - 1) else: return "" -Jeremy [1]: https://pypy.org/performance.html#micro-tuning-tips

On 18/09/2024 05.43, Jeremy Brown wrote:
Hi Jeremy, The page you linked (https://pypy.org/performance.html#string-concatenation-is-expensive) says: “PyPy's JIT makes the overhead of intermediate concatenations go away in linear code that keeps the number of concatenations small, bound and constant” Your code is not linear and the number of concatenations is not bounded. “On the other hand, in code like this with a string-valued foo() function: [code example] the JIT cannot optimize out intermediate copies. This code is actually quadratic in the total size of the mylist strings due to repeated string copies of ever-larger prefix segments.” Your code is more similar to the code example that shows a case where the optimization doesn’t work. Does that explain the behavior you’re seeing? Feel free to make suggestions on how to clarify the wording. -Manuel

Hi Jeremy, Can you share how you are running this function? I tried a few variants, and pypy is often faster than CPython on my attempts, so the rest of your code is necessary to find out what's wrong. Cheers, CF On September 18, 2024 5:43:06 AM GMT+02:00, Jeremy Brown <mischif@mischivous.com> wrote:

Hi, On Wed, 18 Sept 2024 at 23:06, CF Bolz-Tereick via pypy-dev <pypy-dev@python.org> wrote:
Can you share how you are running this function? I tried a few variants, and pypy is often faster than CPython on my attempts, so the rest of your code is necessary to find out what's wrong.
I would call that code THE example of where an inlining, tracing JIT doesn't work. It's recursive, with a dozen unpredictable places where the recursion can occur. Unless the heuristics are good enough to stop *all* inlinings of this function inside itself, then we just get an explosion of traced paths, right? Armin

Manuel,
I think I just interpreted that statement incorrectly, I thought the linear code section explained how the JIT optimized the code, not the preconditions needed for the JIT to function correctly. CF,
Hi Jeremy,
Can you share how you are running this function? I tried a few variants, and pypy is often faster than CPython on my attempts, so the rest of your code is necessary to find out what's wrong.
Sure, here's a link to a copy of the benchmark: https://pastebin.com/fR0C6qcB. I also know the JIT definitely can't optimize the iterative version, but even if this is a problem specific to PyPy I wouldn't expect runs to take twice as long compared to CPython. -Jeremy

Hello Jeremy, ouch. it turns out the problem is neither your code, nor PyPy, but timeit. The timeit module is just not a safe benchmarking tool unless you use it in *exactly* the right way. The way that timeit runs code is using an exec'ed string, eg f"recurse({i})" in your case. This bytecode-compiles some code for every time you call timeit.repeat. This means that the JIT has to compile that bytecode to machine code for every call to repeat. That is the overhead that you were measuring. So your code: for i in range(1, 4000): times["iter"][i] = repeat( stmt = f"iterate({i})", ... simply never warms up. To fix this, you could write your own timing helper. I did that (I'm attaching my version of the script). My results (timing all the 4000 numbers together, per implementation) are like this: CPython 3.11.6: iter 1.976924208 rcsv 2.300890293 PyPy3 7.3.17: iter 0.265697421 (7.4x faster) rcsv 0.643804392 (3.6x faster) I also tried to use a list and "".join, a bytearray, and a pypy-specific stringbuilder, all were slower. Yes, your string additions are quadratic. But the longest string that is being built is 15, and doing 1+2+3+...+15=107 characters copies is still very cheap. My main takeaway would be that micro-benchmarking is very hard :-(. Would it be ok if I use your code in a small PSA blog post about timeit? Cheers, CF On 2024-09-20 04:16, Jeremy Brown wrote:

CF, Thanks for helping me figure this out, I definitely thought the bytecode compilation would be amortized since I set repeat to 1, guess there's more to getting useful benchmarks out of Python than I originally thought. You can use my code for a timeit PSA, hopefully it helps someone else avoid my pitfall. -Jeremy On Fri, Sep 20, 2024 at 5:26 AM CF Bolz-Tereick <cfbolz@gmx.de> wrote:

On 18/09/2024 05.43, Jeremy Brown wrote:
Hi Jeremy, The page you linked (https://pypy.org/performance.html#string-concatenation-is-expensive) says: “PyPy's JIT makes the overhead of intermediate concatenations go away in linear code that keeps the number of concatenations small, bound and constant” Your code is not linear and the number of concatenations is not bounded. “On the other hand, in code like this with a string-valued foo() function: [code example] the JIT cannot optimize out intermediate copies. This code is actually quadratic in the total size of the mylist strings due to repeated string copies of ever-larger prefix segments.” Your code is more similar to the code example that shows a case where the optimization doesn’t work. Does that explain the behavior you’re seeing? Feel free to make suggestions on how to clarify the wording. -Manuel

Hi Jeremy, Can you share how you are running this function? I tried a few variants, and pypy is often faster than CPython on my attempts, so the rest of your code is necessary to find out what's wrong. Cheers, CF On September 18, 2024 5:43:06 AM GMT+02:00, Jeremy Brown <mischif@mischivous.com> wrote:

Hi, On Wed, 18 Sept 2024 at 23:06, CF Bolz-Tereick via pypy-dev <pypy-dev@python.org> wrote:
Can you share how you are running this function? I tried a few variants, and pypy is often faster than CPython on my attempts, so the rest of your code is necessary to find out what's wrong.
I would call that code THE example of where an inlining, tracing JIT doesn't work. It's recursive, with a dozen unpredictable places where the recursion can occur. Unless the heuristics are good enough to stop *all* inlinings of this function inside itself, then we just get an explosion of traced paths, right? Armin

Manuel,
I think I just interpreted that statement incorrectly, I thought the linear code section explained how the JIT optimized the code, not the preconditions needed for the JIT to function correctly. CF,
Hi Jeremy,
Can you share how you are running this function? I tried a few variants, and pypy is often faster than CPython on my attempts, so the rest of your code is necessary to find out what's wrong.
Sure, here's a link to a copy of the benchmark: https://pastebin.com/fR0C6qcB. I also know the JIT definitely can't optimize the iterative version, but even if this is a problem specific to PyPy I wouldn't expect runs to take twice as long compared to CPython. -Jeremy

Hello Jeremy, ouch. it turns out the problem is neither your code, nor PyPy, but timeit. The timeit module is just not a safe benchmarking tool unless you use it in *exactly* the right way. The way that timeit runs code is using an exec'ed string, eg f"recurse({i})" in your case. This bytecode-compiles some code for every time you call timeit.repeat. This means that the JIT has to compile that bytecode to machine code for every call to repeat. That is the overhead that you were measuring. So your code: for i in range(1, 4000): times["iter"][i] = repeat( stmt = f"iterate({i})", ... simply never warms up. To fix this, you could write your own timing helper. I did that (I'm attaching my version of the script). My results (timing all the 4000 numbers together, per implementation) are like this: CPython 3.11.6: iter 1.976924208 rcsv 2.300890293 PyPy3 7.3.17: iter 0.265697421 (7.4x faster) rcsv 0.643804392 (3.6x faster) I also tried to use a list and "".join, a bytearray, and a pypy-specific stringbuilder, all were slower. Yes, your string additions are quadratic. But the longest string that is being built is 15, and doing 1+2+3+...+15=107 characters copies is still very cheap. My main takeaway would be that micro-benchmarking is very hard :-(. Would it be ok if I use your code in a small PSA blog post about timeit? Cheers, CF On 2024-09-20 04:16, Jeremy Brown wrote:

CF, Thanks for helping me figure this out, I definitely thought the bytecode compilation would be amortized since I set repeat to 1, guess there's more to getting useful benchmarks out of Python than I originally thought. You can use my code for a timeit PSA, hopefully it helps someone else avoid my pitfall. -Jeremy On Fri, Sep 20, 2024 at 5:26 AM CF Bolz-Tereick <cfbolz@gmx.de> wrote:

Hi Armin, sort of, yes, but even the linear sequence of ifs in the iterative version is bad. The iterative version compiles 70 traces, recursive one 190. Both kind of bad, but the performance is still ok (see other mail). Cheers, CF On 2024-09-18 23:30, Armin Rigo wrote:
participants (4)
-
Armin Rigo
-
CF Bolz-Tereick
-
Jeremy Brown
-
Manuel Jacob