<div dir="ltr"><div dir="ltr"><br></div><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Tue, Mar 16, 2021 at 2:27 AM Carl Friedrich Bolz-Tereick <<a href="mailto:cfbolz@gmx.de">cfbolz@gmx.de</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">On 3/15/21 11:16 PM, Dan Stromberg wrote:<br>

><br>

> And it's opensource, though many of the inputs are licensed.<br>

><br>

> The code is at <a href="https://stromberg.dnsalias.org/~strombrg/music-pipeline/" rel="noreferrer" target="_blank">https://stromberg.dnsalias.org/~strombrg/music-pipeline/</a><br>

> <<a href="https://stromberg.dnsalias.org/~strombrg/music-pipeline/" rel="noreferrer" target="_blank">https://stromberg.dnsalias.org/~strombrg/music-pipeline/</a>><br>

> (<a href="https://stromberg.dnsalias.org/svn/music-pipeline/trunk/" rel="noreferrer" target="_blank">https://stromberg.dnsalias.org/svn/music-pipeline/trunk/</a><br>

> <<a href="https://stromberg.dnsalias.org/svn/music-pipeline/trunk/" rel="noreferrer" target="_blank">https://stromberg.dnsalias.org/svn/music-pipeline/trunk/</a>>)<br>

><br>

> It appears to be more than 10x slower.<br>

><br>

> I haven't profiled it yet.  I believe it's probably the "Blocklisting<br>

> files..." part that's slow.  That part is O(n*m), and seems to take<br>

> forever.  It's heavy on regular expressions.<br>

><br>

> Are regular expressions expected to be slow on Pypy3?<br>

<br>

Hi Dan,<br>

<br>

Interesting problem! single regular expressions are reasonably fast on<br>

PyPy, being jitted. But I don't think we looked into the problem of<br>

"what if you have thousands of them" before. Your reproducer is hitting<br>

a kind of known, hard to fix corner case of the JIT, it's actually<br>

producing a linear search over the existing regular expressions for<br>

every match call in this case, with catastrophic consequences. It's on<br>

my mid-term plans to work on this problem, but not next week.<br></blockquote><div><br></div><div>Here's another SSCCE that surprised me a little.  I create and del the compiled regexes one at a time, but it's still slow:<br><div><a href="https://stromberg.dnsalias.org/svn/regex-fodder/trunk/regex-fodder-3">https://stromberg.dnsalias.org/svn/regex-fodder/trunk/regex-fodder-3</a><br></div><div></div></div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

Here's a fun workaround, that improves the performance of both CPython<br>

(by about 2x for me) and pypy (by 10x or so): turn the many regular<br>

expressions into a single one:<br>

<br>

     regex_strings = [f"(?:{one_regex()})" for repno in range(2_046)]<br>

     regex_compiled = re.compile("|".join(regex_strings))<br>

<br>

then you replace the match calls with a single one:<br>

<br>

     for filename in filenames:<br>

         if regex_compiled.match(filename):<br>

             matches += 1<br>

<br>

I believe you can try the same approach for your full program?<br>

</blockquote></div><div><br></div><div>I'm familiar with the technique, as well as that of creating a single, big trie regex.  For this application though, I need to check at the end if each regex was matched exactly once, to deter typos causing things to get missed.  Thanks much for the suggestion and more!</div><div><br></div>-- <br><div dir="ltr" class="gmail_signature">Dan Stromberg</div></div>