[Python-ideas] Re: Regex timeouts

Feb. 15, 2022

      On Tue, 15 Feb 2022 at 11:47, Steven D'Aprano <steve@pearwood.info> wrote:
...
...
Another 20% will write ".*?X", with scant understanding that may
extend beyond _just_ "the next" X in some cases.
But this surprises me. Do you have an example?
Nongreedy means it'll prefer the next X, but it has to be open to
checking others.
...
...
...
re.search("a.*?X[^X]*?Y", "zzzabbbXcccXdddYzzz")
<re.Match object; span=(3, 16), match='abbbXcccXdddY'>
The X between bbb and ccc won't result in a match, so the .*? has to
capture more.
...
...
That leaves the happy
5% who write "[^X]*X", which finally says what they intended from the
start.
Doesn't that only work if X is literally a single character?
Yes, but if X is actually "spam", then you can probably do other
assertions to guarantee the right match. It gets pretty clunky though.
...
...
...
...
import re
string = "This is some spam and extra spam."
re.search('[^spam]*spam', string)
<re.Match object; span=(11, 17), match='e spam'>
Whereas this seems to do what I expected:
...
...
...
re.search('.*?spam', string)
<re.Match object; span=(0, 17), match='This is some spam'>
Yes, and that's fine as long as all you care about is whether it
matches. For a simple example like this, it's fine. But this is far
from efficient in more complex cases, since it could potentially have
to check much deeper into the string.

I'm not familiar with SNOBOL, but one thing I am familiar with is C's
scanf (or sscanf etc), which is a parallel to printf, the basis for
Python's percent formatting. REXX has a PARSE command; different
syntax, similar limitations. Either way, it's a zero-backtracking
parser that can do a wide variety of left-to-right scanning, which is
a useful limitation for a number of cases, and actually only gets in
the way very rarely. Here's an example of how it might work in Python:

prefix, spam, eggs = sscanf("There's lots of spam and 12 eggs here,
that's lots of eggs", "%s lots of %s and %d eggs")

The rules are pretty simple: divide up the format into tokens - either
a percent marker or a literal string like " lots of " - and match them
in sequence. A "%s" can match any string, but only up to the next
thing that matches the next token; so the initial "%s" cannot possibly
match the "lots of" near the end of the string - the parser won't even
consider it.

Toss in a few variants like "%[a-z]" which can match any sequence of
the characters a-z, "%4s" which must match precisely four characters,
and "%*s" which matches without returning the value, and you can do a
lot of parsing without ever worrying about exponential parsing time.

The REXX equivalent would be:

data = "There's lots of spam and 12 eggs here, that's lots of eggs"
parse var data prefix " lots of " spam " and " eggs " eggs"

but it's been a very very long time since I did any advanced REXX
parsing, so I can't remember all the details of what it's capable of.

ChrisA

[Python-ideas] Re: Regex timeouts

Chris Angelico