[Python-ideas] Re: Regex timeouts

Feb. 17, 2022

      [J.B. Langston <jblangston@datastax.com>]
...
And unfortunately it does appear that my app took an almost a 20%
performance hit from using regex instead of re, unfortunately.
 Processing time for a test dataset with 700MB of logs went from
77 seconds with the standard library re to 92 seconds with regex.
 Profiling confirms that the time spent in the groupdict method went
from 3.27% to 9.41% and the time spent in match went from
5.19 to 10.33% of the execution time.
I'm mildly surprised! Most times people report that regex is at least
modestly faster than re.

For peak performance at the expense of flexibility (e.g., no
backreferences supported), perhaps you'd get a major speed gain from
an entirely different regexp engine approach. Google built such an
engine, which has worst-case linear matching runtime (but, depending
on details, _may_ require space exponential in the regexp to compile a
regexp object). A Python binding for that is here ("pip install
pyre2"):

https://pypi.org/project/pyre2/

I haven't used it - this isn't a particular area of interest for me.

Note that in its "Performance" section, it shows examples where it
blows re out of the water. But, in all those cases, regex was modestly
to very significantly faster than re too.

YMMV.
...
So this means switching to regex is probably a no go.
The difference between 77 and 92 seconds doesn't, on the face of it,
scream "disaster" to me - but suit yourself.
...
If hanging regexes become a common occurrence for my app I might decide
it's worth the performance hit in the name of safety, but at this point I would
rather not.
If they were destined to become a common occurrence, they already
would have done so. You blamed your bad case on unexpected data, but
the _actual_ cause was an unintended typo in one of your regexps.

So it goes. You're using a tool with a hyper-concise notation, where a
correct expression is pretty much indistinguishable from line noise,
and a typo is rarely detectable as a syntax error.

So you're learning the hard way that you have to be on crisis-level
alert when writing regexps: they're extremely touchy and unforgiving,

pyre2 would spare you from all match-time timing disasters, but
"touchy and unforgiving" applies all the same. Instead of a typo
causing exponential runtime, it may instead cause the regexp to match
(or fail to match) in unintended ways. About which no clue of any kind
will be left behind, unless you stare at the inputs and outputs and
check them yourself. But, in that case, 92 seconds wouldn't even get
you through 1000 bytes ;-)

[Python-ideas] Re: Regex timeouts

Tim Peters