[Python-ideas] Re: Regex timeouts

Feb. 15, 2022

      [Tim]
...
...
In SNOBOL, as I recall, it could be spelled
ARB "spam" FENCE
[Chris]
...
Ah, so that's a bit more complicated than the "no-backtracking"
parsing style of REXX and scanf.
Oh, a lot more complex. In SNOBOL, arbitrary computation can be
performed at any point in pattern matching..

Yet _simple_ things are simple to say. Going back to the original
one-letter "find the leftmost 'x' after the current point" is just

    BREAK('X') 'X'

BREAK is a builtin function that returns a pattern that consumes all
the characters until the first occurrence of one of the letters in its
argument. If it's backed up into via backtracking, it fails: the
shortest string is the only one it will consume.

If you want something fancier, fine - you can code anything you like.
For example, here's a pattern object (BAL) that matches the longest
non-empty substring balanced with respect to parentheses:

BALEXP = NOTANY('()') '(' ARBNO( *BALEXP) ')'
BAL = BALEXP ARBNO(BALEXP)

ARBNO is akin to a regexp .*? quantifier. The '*' in front of BALEXP
in the first line tells SNOBOL to delay evaluating BALEXP until the
pattern runs; at the time BALEXP is _defined_ by that line, BALEXP
isn't yet bound to anything useful.
...
Does this guarantee anything about performance, or is it primarily to allow
you to write patterns that can't mismatch strings?
They were inventing "pattern matching" in the 1960s. and were looking
for expressive notations that were easy for experts to use.  Regular
expressions were much more of theoretical interest at the time, and
offered far feebler abilities than SNOBOL's pattern abilities anyway.

It's still interesting to me that they settled on a relatively "wordy"
notation, wh\ch is pretty easy to pick up. This is perhaps because
regexps are so feeble in comparison that they allowed proving
mountains of properties, so a hyper-concise notation for them was
invented to make publishing long-winded proofs possible ;-)

But while SNOBOL was wildly inventive for its time, it was always
intended to be used "to get real work done". Note too that potential
backtracking isn't hiding in notation: it's quite explicit in names
like ARB and ARBNO, and, of course, in pattern alternation (the infix
binary "|" operator). Most builtin pattern functions and objects in
SNOBOL are "one and done" (they match or they don't, and fail if
backtracked into).

Want a string of digits? SPAN('0123456789'). LIke a regexp's \d+, but
will _only_ match the longest string of digits starting at the current
point. The possibility for backtracking at the _lexical_ level is
almost always a misfeature.

Which didn't matter in the seminal lex+yacc scheme. yacc consumed the
tokens lex's regexps produced, and if yacc failed to match the grammar
lex didn't go on to try a different way of tokenizing. In effect, \d+
in lex acted like SNOBOL's SPAN.

regexps didn['t manage the transition from "a tool" to "a solution"
gracefully ;-)

[Python-ideas] Re: Regex timeouts

Tim Peters