
On Wed, 16 Feb 2022 at 21:01, Stephen J. Turnbull <stephenjturnbull@gmail.com> wrote:
Chris Angelico writes:
On Wed, 16 Feb 2022 at 01:54, Stephen J. Turnbull <stephenjturnbull@gmail.com> wrote:
That is, all regexp implementations support the same basic language which is sufficient for most tasks most programmers want regexps for.
The problem is that that's an illusion.
It isn't for me. I write a lot of regexps for several languages (Python, grep, sed, Emacs Lisp), I rarely have to debug one, and in a year there may be one that debugging requires more than reading each character out loud and recognizing a typo.
I've used simple regexps in sed and grep, and found differences about what needs to be escaped, so even when you don't use the advanced features, you need to be aware of them.
But with regexps, I don't, you see. Besides regexps, I write a lot of (trivial to simple) parsers. For the regexps, I don't need much more than ()[]+*?.\s\d\t\r\n most of the time (and those last 3 are due to tab-separated value files and RFC 822, special cases). I could probably use scanf (except that Python, sed, and Emacs Lisp don't have it ;-) but for the lack of []. Occasionally for things like date parsing and other absolutely fixed-field contexts I'll use {}. I do sanity-checking on the result frequently.
Not sure what you mean by "lack of []", but some scanf variants do support that - for instance, %[a-z] will only match lowercase alpha.
If the regexp engine supports it, I'll use named groups and other such syntactic sugar. In a general purpose programming language, if it supports "literate regexps", I use those, if not, I use separate strings (which also makes it easy to back out of a large regexp into statements if I need to establish backtracking boundaries).
That's what I mean about the illusion. You can't use named groups in all regexp engines.
"why are other things not ALSO popular". I honestly think that scanf parsing, if implemented ad-hoc by different programming languages and extended to their needs, would end up
... becoming regexp-like, and not just in the sense of
no less different from each other than different regexp engines are - the most-used parts would also be the most-compatible, just like with regexps.
;-)
Heh, probably true :)
What I think is more interesting than simpler (but more robust for what they can do) facilities is better parser support in standard libraries (not just Python's), and more use of them in place of hand-written "parsers" that just eat tokens defined by regexps in order. If one could, for example, write
[ "Sun|Mon|Tue|Wed|Thu|Fri|Sat" : dow, ", ". "(?: |\d)\d)" : day, " ", "Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec" : month, " ", "\d\d\d\d" : year, " ", "\d\d:\d\d:\d\d" : time, " ", "[+-]\d\d\d\d" : tzoffset ]
(which is not legal Python syntax but I'm too lazy to try to come up with something better) to parse an RFC 822 date, I think people would use that. Sure, for something *that* regular, most people would probably use the evident "literate" regexp with named groups, but it wouldn't take much complexity to make such a parser generator worthwhile to programmers.
That's an interesting concept. I can imagine writing it declaratively like this: class Date(parser): dow: "Sun|Mon|Tue|Wed|Thu|Fri|Sat" _: ", " day: "(?: |\d)\d)" _: " " month: "Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec" _: " " year: "\d\d\d\d" _: " " time: "\d\d:\d\d:\d\d" _: " " tzoffset: "[+-]\d\d\d\d" Would it be better than a plain regex? Not sure. ChrisA