![](https://secure.gravatar.com/avatar/8da339f04438d3fcc438e898cfe73c47.jpg?s=120&d=mm&r=g)
Chris Angelico writes:
On Wed, 16 Feb 2022 at 01:54, Stephen J. Turnbull <stephenjturnbull@gmail.com> wrote:
That is, all regexp implementations support the same basic language which is sufficient for most tasks most programmers want regexps for.
The problem is that that's an illusion.
It isn't for me. I write a lot of regexps for several languages (Python, grep, sed, Emacs Lisp), I rarely have to debug one, and in a year there may be one that debugging requires more than reading each character out loud and recognizing a typo. As a sometime Emacsen dev, I also do a fair amount of debugging of other people's regexps. Yuck! But it's almost always the case that (modulo efficiency considerations) it's pretty easy to figure out what they *want*, and rewrite the code (*not* the *regexp(s)*!) to use simpler regexps (usually parts of the original) in a more controlled way.
If you restrict yourself to the subset that's supported by every regexp implementation, you'll quickly find tasks that you can't handle.
That's true of everything in programming, there are no tools that can handle everything until you have a Turing-complete programming language, and even then, practically there are always things that are too painful even for masochists to do in that language. But with regexps, I don't, you see. Besides regexps, I write a lot of (trivial to simple) parsers. For the regexps, I don't need much more than ()[]+*?.\s\d\t\r\n most of the time (and those last 3 are due to tab-separated value files and RFC 822, special cases). I could probably use scanf (except that Python, sed, and Emacs Lisp don't have it ;-) but for the lack of []. Occasionally for things like date parsing and other absolutely fixed-field contexts I'll use {}. I do sanity-checking on the result frequently. If the regexp engine supports it, I'll use named groups and other such syntactic sugar. In a general purpose programming language, if it supports "literate regexps", I use those, if not, I use separate strings (which also makes it easy to back out of a large regexp into statements if I need to establish backtracking boundaries). Sure, if you want to do all of that *in* a single regexp, you are indeed going to run into things you can't do that way. When I wrote, "what people want to do" I meant tasks where regexps could do a lot of the task, but not that they could do the whole thing in one regexp. For my style, regexps are something that's available in a very wide array of contexts, and consistent enough to get the job done. I treat complex regexps the way I treat C extensions: only if the performance benefit is human-perceptible, which is pretty rare.
"why are other things not ALSO popular". I honestly think that scanf parsing, if implemented ad-hoc by different programming languages and extended to their needs, would end up
... becoming regexp-like, and not just in the sense of
no less different from each other than different regexp engines are - the most-used parts would also be the most-compatible, just like with regexps.
;-) What I think is more interesting than simpler (but more robust for what they can do) facilities is better parser support in standard libraries (not just Python's), and more use of them in place of hand-written "parsers" that just eat tokens defined by regexps in order. If one could, for example, write [ "Sun|Mon|Tue|Wed|Thu|Fri|Sat" : dow, ", ". "(?: |\d)\d)" : day, " ", "Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec" : month, " ", "\d\d\d\d" : year, " ", "\d\d:\d\d:\d\d" : time, " ", "[+-]\d\d\d\d" : tzoffset ] (which is not legal Python syntax but I'm too lazy to try to come up with something better) to parse an RFC 822 date, I think people would use that. Sure, for something *that* regular, most people would probably use the evident "literate" regexp with named groups, but it wouldn't take much complexity to make such a parser generator worthwhile to programmers. Steve