Program slower on Pypy 7.3.3 (3.7.9) than CPython 3.9.

And it's opensource, though many of the inputs are licensed. The code is at https://stromberg.dnsalias.org/~strombrg/music-pipeline/ ( https://stromberg.dnsalias.org/svn/music-pipeline/trunk/) It appears to be more than 10x slower. I haven't profiled it yet. I believe it's probably the "Blocklisting files..." part that's slow. That part is O(n*m), and seems to take forever. It's heavy on regular expressions. Are regular expressions expected to be slow on Pypy3? Thanks. -- Dan Stromberg

I put a small SSCCE for this at https://stromberg.dnsalias.org/svn/regex-fodder/trunk It's almost 10x slower, not quite as much as music-pipeline. On Mon, Mar 15, 2021 at 3:17 PM Dan Stromberg <strombrg@gmail.com> wrote:

On 3/15/21 11:16 PM, Dan Stromberg wrote:
Hi Dan, Interesting problem! single regular expressions are reasonably fast on PyPy, being jitted. But I don't think we looked into the problem of "what if you have thousands of them" before. Your reproducer is hitting a kind of known, hard to fix corner case of the JIT, it's actually producing a linear search over the existing regular expressions for every match call in this case, with catastrophic consequences. It's on my mid-term plans to work on this problem, but not next week. Here's a fun workaround, that improves the performance of both CPython (by about 2x for me) and pypy (by 10x or so): turn the many regular expressions into a single one: regex_strings = [f"(?:{one_regex()})" for repno in range(2_046)] regex_compiled = re.compile("|".join(regex_strings)) then you replace the match calls with a single one: for filename in filenames: if regex_compiled.match(filename): matches += 1 I believe you can try the same approach for your full program? Cheers, Carl Friedrich

On Tue, Mar 16, 2021 at 2:27 AM Carl Friedrich Bolz-Tereick <cfbolz@gmx.de> wrote:
Here's another SSCCE that surprised me a little. I create and del the compiled regexes one at a time, but it's still slow: https://stromberg.dnsalias.org/svn/regex-fodder/trunk/regex-fodder-3
I'm familiar with the technique, as well as that of creating a single, big trie regex. For this application though, I need to check at the end if each regex was matched exactly once, to deter typos causing things to get missed. Thanks much for the suggestion and more! -- Dan Stromberg

Hi folks. I've modified my code to use str.startswith instead of re.match. I had a one-to-one correspondence between filenames and regexes anyway, so it doesn't really sacrifice anything. This way the original app (music-pipeline) is nice and fast now on pypy3 7.3.3. I'm leaving the various SSCCE's at https://stromberg.dnsalias.org/svn/regex-fodder/trunk in case someone wants to use them to replicate the problem going forward. They're commented to describe what they do and whether they are fast or slow. As Carl said, the issue seems to be that pypy3 7.3.3 doesn't like having very many regular expressions in the same program - even if only one compiled regex exists at any given time (no-longer-needed regexes disposed of with del). Thanks again!

Thanks Dan! It's definitely a real problem that you identified, I've filed a bug about it here: https://foss.heptapod.net/pypy/pypy/-/issues/3419 Thanks for the report! Carl Friedrich On March 18, 2021 6:32:19 PM GMT+01:00, Dan Stromberg <drsalists@gmail.com> wrote:

I put a small SSCCE for this at https://stromberg.dnsalias.org/svn/regex-fodder/trunk It's almost 10x slower, not quite as much as music-pipeline. On Mon, Mar 15, 2021 at 3:17 PM Dan Stromberg <strombrg@gmail.com> wrote:

On 3/15/21 11:16 PM, Dan Stromberg wrote:
Hi Dan, Interesting problem! single regular expressions are reasonably fast on PyPy, being jitted. But I don't think we looked into the problem of "what if you have thousands of them" before. Your reproducer is hitting a kind of known, hard to fix corner case of the JIT, it's actually producing a linear search over the existing regular expressions for every match call in this case, with catastrophic consequences. It's on my mid-term plans to work on this problem, but not next week. Here's a fun workaround, that improves the performance of both CPython (by about 2x for me) and pypy (by 10x or so): turn the many regular expressions into a single one: regex_strings = [f"(?:{one_regex()})" for repno in range(2_046)] regex_compiled = re.compile("|".join(regex_strings)) then you replace the match calls with a single one: for filename in filenames: if regex_compiled.match(filename): matches += 1 I believe you can try the same approach for your full program? Cheers, Carl Friedrich

On Tue, Mar 16, 2021 at 2:27 AM Carl Friedrich Bolz-Tereick <cfbolz@gmx.de> wrote:
Here's another SSCCE that surprised me a little. I create and del the compiled regexes one at a time, but it's still slow: https://stromberg.dnsalias.org/svn/regex-fodder/trunk/regex-fodder-3
I'm familiar with the technique, as well as that of creating a single, big trie regex. For this application though, I need to check at the end if each regex was matched exactly once, to deter typos causing things to get missed. Thanks much for the suggestion and more! -- Dan Stromberg

Hi folks. I've modified my code to use str.startswith instead of re.match. I had a one-to-one correspondence between filenames and regexes anyway, so it doesn't really sacrifice anything. This way the original app (music-pipeline) is nice and fast now on pypy3 7.3.3. I'm leaving the various SSCCE's at https://stromberg.dnsalias.org/svn/regex-fodder/trunk in case someone wants to use them to replicate the problem going forward. They're commented to describe what they do and whether they are fast or slow. As Carl said, the issue seems to be that pypy3 7.3.3 doesn't like having very many regular expressions in the same program - even if only one compiled regex exists at any given time (no-longer-needed regexes disposed of with del). Thanks again!

Thanks Dan! It's definitely a real problem that you identified, I've filed a bug about it here: https://foss.heptapod.net/pypy/pypy/-/issues/3419 Thanks for the report! Carl Friedrich On March 18, 2021 6:32:19 PM GMT+01:00, Dan Stromberg <drsalists@gmail.com> wrote:
participants (3)
-
Carl Friedrich Bolz-Tereick
-
Dan Stromberg
-
Dan Stromberg