[Python-ideas] Re: Regex timeouts

Feb. 15, 2022

      [Tim]
...
...
...
That leaves the happy 5% who write "[^X]*X", which
finally says what they intended from the start.
[Steven]
...
...
Doesn't that only work if X is literally a single character?
RIght. It was an examp[e, not a meta-example. Even for a _single
character_, "match up to the next, but never more or less than that"
is a puzzle for most regexp users.

[Chris]
...
Yes, but if X is actually "spam", then you can probably do other
assertions to guarantee the right match. It gets pretty clunky though.
Assertions aren't needed, but it is nightmarish to get right.

(|[^s]|s(|[^p]|p(|[^a]|a(|[^m]))))*spam

The "spam" at the end is the only obvious part ;-)

Before then, we match 0 or more instances of

    nothing
    or not 's'
    or 's' followed by
        nothing
        or not 'p'
        or 'p' followed by
            nothing
            or not 'a'
            or 'a' followed by
                nothing
                or not 'm'

"spam" itself can't get through that maze, so backtracking into it
after its first match can't consume the matched "spam" to find a later
one.

In SNOBOL, as I recall, it could be spelled

    ARB "spam" FENCE

Those are all pattern objects, and infix whitespace is a binary
pattern catenation operator.

ARB is a builtin pattern that matches the empty string at first, and
extends what it matches by one character each time it's backtracked
into.

"spam" matches the obvious string.

Then FENCE is a builtin pattern that matches an empty string, but acts
as a backtracking barrier: if the overall match attempt fails,
backtracking will not move "to the left" of FENCE. So, here, ARB will
not get a chance to consume more characters after the leftmost "spam"
is found.

[Python-ideas] Re: Regex timeouts

Tim Peters