[Python-ideas] Re: Regex timeouts

Feb. 16, 2022

      Chris Angelico writes:
...
On Wed, 16 Feb 2022 at 01:54, Stephen J. Turnbull
<stephenjturnbull@gmail.com> wrote:
...
...
That is, all regexp implementations support the same basic
language which is sufficient for most tasks most programmers want
regexps for.
The problem is that that's an illusion.
It isn't for me.  I write a lot of regexps for several languages
(Python, grep, sed, Emacs Lisp), I rarely have to debug one, and in a
year there may be one that debugging requires more than reading each
character out loud and recognizing a typo.

As a sometime Emacsen dev, I also do a fair amount of debugging of
other people's regexps.  Yuck!  But it's almost always the case that
(modulo efficiency considerations) it's pretty easy to figure out what
they *want*, and rewrite the code (*not* the *regexp(s)*!) to use
simpler regexps (usually parts of the original) in a more controlled
way.
...
If you restrict yourself to the subset that's supported by every
regexp implementation, you'll quickly find tasks that you can't
handle.
That's true of everything in programming, there are no tools that can
handle everything until you have a Turing-complete programming
language, and even then, practically there are always things that are
too painful even for masochists to do in that language.

But with regexps, I don't, you see.  Besides regexps, I write a lot of
(trivial to simple) parsers.  For the regexps, I don't need much more
than ()[]+*?.\s\d\t\r\n most of the time (and those last 3 are due to
tab-separated value files and RFC 822, special cases).  I could
probably use scanf (except that Python, sed, and Emacs Lisp don't have
it ;-) but for the lack of [].  Occasionally for things like date
parsing and other absolutely fixed-field contexts I'll use {}.  I do
sanity-checking on the result frequently.

If the regexp engine supports it, I'll use named groups and other such
syntactic sugar.  In a general purpose programming language, if it
supports "literate regexps", I use those, if not, I use separate
strings (which also makes it easy to back out of a large regexp into
statements if I need to establish backtracking boundaries).

Sure, if you want to do all of that *in* a single regexp, you are
indeed going to run into things you can't do that way.  When I wrote,
"what people want to do" I meant tasks where regexps could do a lot of
the task, but not that they could do the whole thing in one regexp.
For my style, regexps are something that's available in a very wide
array of contexts, and consistent enough to get the job done.  I treat
complex regexps the way I treat C extensions: only if the performance
benefit is human-perceptible, which is pretty rare.
...
"why are other things not ALSO popular". I honestly think that scanf
parsing, if implemented ad-hoc by different programming languages and
extended to their needs, would end up
... becoming regexp-like, and not just in the sense of
...
no less different from each other than different regexp engines are
- the most-used parts would also be the most-compatible, just like
with regexps.
;-)

What I think is more interesting than simpler (but more robust for
what they can do) facilities is better parser support in standard
libraries (not just Python's), and more use of them in place of
hand-written "parsers" that just eat tokens defined by regexps in
order.  If one could, for example, write

[ "Sun|Mon|Tue|Wed|Thu|Fri|Sat" : dow,
  ", ".
  "(?: |\d)\d)" : day,
  " ",
  "Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec" : month,
  " ",
  "\d\d\d\d" : year,
  " ",
  "\d\d:\d\d:\d\d" : time,
  " ",
  "[+-]\d\d\d\d" : tzoffset ]

(which is not legal Python syntax but I'm too lazy to try to come up
with something better) to parse an RFC 822 date, I think people would
use that.  Sure, for something *that* regular, most people would
probably use the evident "literate" regexp with named groups, but it
wouldn't take much complexity to make such a parser generator
worthwhile to programmers.

Steve