[Python-ideas] Re: Regex timeouts

Feb. 16, 2022

      On Wed, 16 Feb 2022 at 21:01, Stephen J. Turnbull
<stephenjturnbull@gmail.com> wrote:
...
Chris Angelico writes:
...
On Wed, 16 Feb 2022 at 01:54, Stephen J. Turnbull
<stephenjturnbull@gmail.com> wrote:
...
...
That is, all regexp implementations support the same basic
language which is sufficient for most tasks most programmers want
regexps for.
The problem is that that's an illusion.
It isn't for me.  I write a lot of regexps for several languages
(Python, grep, sed, Emacs Lisp), I rarely have to debug one, and in a
year there may be one that debugging requires more than reading each
character out loud and recognizing a typo.
I've used simple regexps in sed and grep, and found differences about
what needs to be escaped, so even when you don't use the advanced
features, you need to be aware of them.
...
But with regexps, I don't, you see.  Besides regexps, I write a lot of
(trivial to simple) parsers.  For the regexps, I don't need much more
than ()[]+*?.\s\d\t\r\n most of the time (and those last 3 are due to
tab-separated value files and RFC 822, special cases).  I could
probably use scanf (except that Python, sed, and Emacs Lisp don't have
it ;-) but for the lack of [].  Occasionally for things like date
parsing and other absolutely fixed-field contexts I'll use {}.  I do
sanity-checking on the result frequently.
Not sure what you mean by "lack of []", but some scanf variants do
support that - for instance, %[a-z] will only match lowercase alpha.
...
If the regexp engine supports it, I'll use named groups and other such
syntactic sugar.  In a general purpose programming language, if it
supports "literate regexps", I use those, if not, I use separate
strings (which also makes it easy to back out of a large regexp into
statements if I need to establish backtracking boundaries).
That's what I mean about the illusion. You can't use named groups in
all regexp engines.
...
...
"why are other things not ALSO popular". I honestly think that scanf
parsing, if implemented ad-hoc by different programming languages and
extended to their needs, would end up
... becoming regexp-like, and not just in the sense of
...
no less different from each other than different regexp engines are
- the most-used parts would also be the most-compatible, just like
with regexps.
;-)
Heh, probably true :)
...
What I think is more interesting than simpler (but more robust for
what they can do) facilities is better parser support in standard
libraries (not just Python's), and more use of them in place of
hand-written "parsers" that just eat tokens defined by regexps in
order.  If one could, for example, write
[ "Sun|Mon|Tue|Wed|Thu|Fri|Sat" : dow,
  ", ".
  "(?: |\d)\d)" : day,
  " ",
  "Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec" : month,
  " ",
  "\d\d\d\d" : year,
  " ",
  "\d\d:\d\d:\d\d" : time,
  " ",
  "[+-]\d\d\d\d" : tzoffset ]
(which is not legal Python syntax but I'm too lazy to try to come up
with something better) to parse an RFC 822 date, I think people would
use that.  Sure, for something *that* regular, most people would
probably use the evident "literate" regexp with named groups, but it
wouldn't take much complexity to make such a parser generator
worthwhile to programmers.
That's an interesting concept. I can imagine writing it declaratively like this:

class Date(parser):
    dow: "Sun|Mon|Tue|Wed|Thu|Fri|Sat"
    _: ", "
    day: "(?: |\d)\d)"
    _: " "
    month: "Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec"
    _: " "
    year: "\d\d\d\d"
    _: " "
    time: "\d\d:\d\d:\d\d"
    _: " "
    tzoffset: "[+-]\d\d\d\d"

Would it be better than a plain regex? Not sure.

ChrisA