Hello, I had opened this bug because I had a bad regex in my code that was causing python to hang in the regex evaluation: https://bugs.python.org/issue46627. I have fixed the problem with that specific regex, but more generally I think it would be good to have a timeout option that could be configurable when compiling the regex so that if the regex didn't complete within the specified timeframe, it would abort and throw an exception. The bug was closed as Won't Fix and it was suggested that I bring up my idea here. The suggestion made to me on the bug was to read Mastering Regular Expressions and get better at writing regexes. I will take this advice, but this isn't really a reasonable solution to my problem for a few reasons. My use case is log parsing and I have a large number of regexes that run over many different log lines. With the volume of regexes I have, it's hard to make sure every regex has no potential problems, especially when the pathological behavior only occurs on certain inputs that may not have been anticipated when developing the regex. Also because of the volume of data these regexes are parsing, I would never want to allow a regex to run longer than a few milliseconds because if it did, that would kill my log processing throughput. I'd rather that it just raise an exception and move on to the next log entry. Thanks, J.B.
For what it's worth, the "regex" library on PyPI (not "re") supports timeouts: https://pypi.org/project/regex/ On Mon, Feb 14, 2022, 6:54 PM J.B. Langston <jblangston@datastax.com> wrote:
Hello,
I had opened this bug because I had a bad regex in my code that was causing python to hang in the regex evaluation: https://bugs.python.org/issue46627. I have fixed the problem with that specific regex, but more generally I think it would be good to have a timeout option that could be configurable when compiling the regex so that if the regex didn't complete within the specified timeframe, it would abort and throw an exception.
The bug was closed as Won't Fix and it was suggested that I bring up my idea here. The suggestion made to me on the bug was to read Mastering Regular Expressions and get better at writing regexes. I will take this advice, but this isn't really a reasonable solution to my problem for a few reasons. My use case is log parsing and I have a large number of regexes that run over many different log lines. With the volume of regexes I have, it's hard to make sure every regex has no potential problems, especially when the pathological behavior only occurs on certain inputs that may not have been anticipated when developing the regex. Also because of the volume of data these regexes are parsing, I would never want to allow a regex to run longer than a few milliseconds because if it did, that would kill my log processing throughput. I'd rather that it just raise an exception and move on to the next log entry. Thanks, J.B. _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/BJJFVT... Code of Conduct: http://python.org/psf/codeofconduct/
On Mon, Feb 14, 2022 at 9:55 AM J.B. Langston <jblangston@datastax.com> wrote:
... more generally I think it would be good to have a timeout option that could be configurable when compiling the regex so that if the regex didn't complete within the specified timeframe, it would abort and throw an exception.
... The suggestion made to me on the bug was to read Mastering Regular Expressions and get better at writing regexes. I will take this advice, but this isn't really a reasonable solution to my problem for a few reasons. My use case is log parsing and I have a large number of regexes that run over many different log lines. With the volume of regexes I have, it's hard to make sure every regex has no potential problems, especially when the pathological behavior only occurs on certain inputs that may not have been anticipated when developing the regex.
A regex that's vulnerable to pathological behavior is a DoS attack waiting to happen. Especially when used for parsing log data (which might contain untrusted data). If possible, we should make it harder for people to shoot themselves in the feet. As an aside, pure regex's are not vulnerable, only extended regex. However, Python doesn't (that I know of) have a class that only accepts pure regex. --- Bruce
A regex that's vulnerable to pathological behavior is a DoS attack waiting to happen. Especially when used for parsing log data (which might contain untrusted data). If possible, we should make it harder for people to shoot themselves in the feet.
While definitely not as bad and not as likely as SQL injection, I think the possibility of regex DoS is totally missing in the stdlib re docs. Should there be something added there about if you need to put user input into an expression, best practice is to re.escape it?
On Mon, Feb 14, 2022 at 03:58:49PM -0600, Nick Timkovich wrote:
While definitely not as bad and not as likely as SQL injection, I think the possibility of regex DoS is totally missing in the stdlib re docs. Should there be something added there about if you need to put user input into an expression, best practice is to re.escape it?
That doesn't help you when you wish to allow the user to specify a regex as the search term. -- Steve
A regex that's vulnerable to pathological behavior is a DoS attack waiting
to happen. Especially when used for parsing log data (which might contain untrusted data). If possible, we should make it harder for people to shoot themselves in the feet.
And this is exactly what happened to me. I have a job that automatically parses logs as they are uploaded, and a log came in that had an unexpected pattern that triggered pathological behavior in my regex that did not occur when processing the expected input. This caused the import pipeline to back up for many hours before I noticed and fixed it.
While definitely not as bad and not as likely as SQL injection, I think the possibility of regex DoS is totally missing in the stdlib re docs. Should there be something added there about if you need to put user input into an expression, best practice is to re.escape it?
Unless I am missing something, I don't see how re.escape would have helped me here. I wasn't trying to treat arbitrary input as a regex, so escaping the regex characters in it wouldn't have done anything to help me. The problem is that a regex *that I wrote* had a bug in it that caused pathological behavior, but it wasn't found during testing because it only occurred when matching against an unexpected input. -- [image: DataStax Logo Square] <https://www.datastax.com/> *J.B. Langston* Tech Support Tools Wrangler +1 650 389 6000 <16503896000> | datastax.com <https://www.datastax.com/> Find DataStax Online: [image: LinkedIn Logo] <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.linkedin.com_company_datastax&d=DwMFaQ&c=adz96Xi0w1RHqtPMowiL2g&r=IFj3MdIKYLLXIUhYdUGB0cTzTlxyCb7_VUmICBaYilU&m=uHzE4WhPViSF0rsjSxKhfwGDU1Bo7USObSc_aIcgelo&s=akx0E6l2bnTjOvA-YxtonbW0M4b6bNg4nRwmcHNDo4Q&e=> [image: Facebook Logo] <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.facebook.com_datastax&d=DwMFaQ&c=adz96Xi0w1RHqtPMowiL2g&r=IFj3MdIKYLLXIUhYdUGB0cTzTlxyCb7_VUmICBaYilU&m=uHzE4WhPViSF0rsjSxKhfwGDU1Bo7USObSc_aIcgelo&s=ncMlB41-6hHuqx-EhnM83-KVtjMegQ9c2l2zDzHAxiU&e=> [image: Twitter Logo] <https://twitter.com/DataStax> [image: RSS Feed] <https://www.datastax.com/blog/rss.xml> [image: Github Logo] <https://github.com/datastax> On Mon, Feb 14, 2022 at 3:59 PM Nick Timkovich <prometheus235@gmail.com> wrote:
A regex that's vulnerable to pathological behavior is a DoS attack waiting
to happen. Especially when used for parsing log data (which might contain untrusted data). If possible, we should make it harder for people to shoot themselves in the feet.
While definitely not as bad and not as likely as SQL injection, I think the possibility of regex DoS is totally missing in the stdlib re docs. Should there be something added there about if you need to put user input into an expression, best practice is to re.escape it?
-- [image: DataStax Logo Square] <https://www.datastax.com/> *J.B. Langston* Tech Support Tools Wrangler +1 650 389 6000 <16503896000> | datastax.com <https://www.datastax.com/> Find DataStax Online: [image: LinkedIn Logo] <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.linkedin.com_company_datastax&d=DwMFaQ&c=adz96Xi0w1RHqtPMowiL2g&r=IFj3MdIKYLLXIUhYdUGB0cTzTlxyCb7_VUmICBaYilU&m=uHzE4WhPViSF0rsjSxKhfwGDU1Bo7USObSc_aIcgelo&s=akx0E6l2bnTjOvA-YxtonbW0M4b6bNg4nRwmcHNDo4Q&e=> [image: Facebook Logo] <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.facebook.com_datastax&d=DwMFaQ&c=adz96Xi0w1RHqtPMowiL2g&r=IFj3MdIKYLLXIUhYdUGB0cTzTlxyCb7_VUmICBaYilU&m=uHzE4WhPViSF0rsjSxKhfwGDU1Bo7USObSc_aIcgelo&s=ncMlB41-6hHuqx-EhnM83-KVtjMegQ9c2l2zDzHAxiU&e=> [image: Twitter Logo] <https://twitter.com/DataStax> [image: RSS Feed] <https://www.datastax.com/blog/rss.xml> [image: Github Logo] <https://github.com/datastax>
How embarassing... I apologize for all the signature garbage at the end of my message.
""" Some people, when confronted with a problem, think “I know, I'll use regular expressions.” Now they have two problems. - Jamie Zawinski """ Even more true of regexps than of floating point, and even of daemon threads ;-) regex is a terrific module, incorporating many features that newer regexp implementations support. But there's a downside: if anyone was inclined to work on Python's regexp engine, the existence of that more refined alternative is a huge disincentive. Why bother? regex already does it. I would like to see regex folded into the Python core, but so far its author has shown no sign of wanting to do so. BTW, regex is also harder to provoke into exponential-time bad cases. While learning to avoid bad cases to begin with requires climbing a learning curve, it's not all that hard. People typically get in trouble by slamming in \s* and \s+ all over the place, then get surprised in a non-matching-overall case when the engine needs exponential time to work through all possible ways of segmenting whitespace. Well, that's what you told it to do. Friedl's book spells out systematic ways to avoid that. regex also supports newer ideas, like "atomic groups", that can be used explicitly to limit backtracking. BTW, there's scant general use for a "linear time" regexp engine. If it's "pure", it can require exponential _space_ to build a compiled regexp, and won't support many features people have become used to. The most gonzo modern packages are indeed "gonzo", doing a mix of linear-time algorithms and on-the-fly switching to backtracking algorithms akin to Python's and Perl's. Just supporting capturing groups (which almost all people using regexps for "parsing" use) typically requires two passes under the covers: a linear-time pass to see whether there's a match overall, and, if that succeeds, a second pass with possible backtracking, using crumbs dropped by the first pass to cut off some futile alternatives. For those with long memories, regexps in practical Comp Sci were typically used for lexers, simple expressions to identify tokens which were in turn consumed by actual parsers. The Unix™ lex and yacc were very popular in their day, each using the right tool for its job. An interesting lesson nobody wants to learn: the original major string-processing language, SNOBOL, had powerful pattern matching but no regexps. Griswold's more modern successor language, Icon, found no reason to change that. Naive regexps are both clumsy and prone to bad timing in many tasks that "should be" very easy to express. For example, "now match up to the next occurrence of 'X'". In SNOBOL and Icon, that's trivial. 75% of regexp users will write ".*X", with scant understanding that it may match waaaay more than they intended. Another 20% will write ".*?X", with scant understanding that may extend beyond _just_ "the next" X in some cases. That leaves the happy 5% who write "[^X]*X", which finally says what they intended from the start. Read the Zawinski quote again ;-)
On Mon, Feb 14, 2022 at 05:13:38PM -0600, Tim Peters wrote:
An interesting lesson nobody wants to learn: the original major string-processing language, SNOBOL, had powerful pattern matching but no regexps. Griswold's more modern successor language, Icon, found no reason to change that.
I've been interested in the existence of SNOBOL string scanning for a long time, but I know very little about it. How does it differ from regexes, and why have programming languages pretty much standardised on regexes rather than other forms of string matching?
Naive regexps are both clumsy and prone to bad timing in many tasks that "should be" very easy to express. For example, "now match up to the next occurrence of 'X'". In SNOBOL and Icon, that's trivial. 75% of regexp users will write ".*X", with scant understanding that it may match waaaay more than they intended.
Indeed, I've been bitten by that many times :-)
Another 20% will write ".*?X", with scant understanding that may extend beyond _just_ "the next" X in some cases.
But this surprises me. Do you have an example?
That leaves the happy 5% who write "[^X]*X", which finally says what they intended from the start.
Doesn't that only work if X is literally a single character?
import re string = "This is some spam and extra spam." re.search('[^spam]*spam', string) <re.Match object; span=(11, 17), match='e spam'>
Whereas this seems to do what I expected:
re.search('.*?spam', string) <re.Match object; span=(0, 17), match='This is some spam'>
-- Steve
On Tue, 15 Feb 2022 at 11:47, Steven D'Aprano <steve@pearwood.info> wrote:
Another 20% will write ".*?X", with scant understanding that may extend beyond _just_ "the next" X in some cases.
But this surprises me. Do you have an example?
Nongreedy means it'll prefer the next X, but it has to be open to checking others.
re.search("a.*?X[^X]*?Y", "zzzabbbXcccXdddYzzz") <re.Match object; span=(3, 16), match='abbbXcccXdddY'>
The X between bbb and ccc won't result in a match, so the .*? has to capture more.
That leaves the happy 5% who write "[^X]*X", which finally says what they intended from the start.
Doesn't that only work if X is literally a single character?
Yes, but if X is actually "spam", then you can probably do other assertions to guarantee the right match. It gets pretty clunky though.
import re string = "This is some spam and extra spam." re.search('[^spam]*spam', string) <re.Match object; span=(11, 17), match='e spam'>
Whereas this seems to do what I expected:
re.search('.*?spam', string) <re.Match object; span=(0, 17), match='This is some spam'>
Yes, and that's fine as long as all you care about is whether it matches. For a simple example like this, it's fine. But this is far from efficient in more complex cases, since it could potentially have to check much deeper into the string. I'm not familiar with SNOBOL, but one thing I am familiar with is C's scanf (or sscanf etc), which is a parallel to printf, the basis for Python's percent formatting. REXX has a PARSE command; different syntax, similar limitations. Either way, it's a zero-backtracking parser that can do a wide variety of left-to-right scanning, which is a useful limitation for a number of cases, and actually only gets in the way very rarely. Here's an example of how it might work in Python: prefix, spam, eggs = sscanf("There's lots of spam and 12 eggs here, that's lots of eggs", "%s lots of %s and %d eggs") The rules are pretty simple: divide up the format into tokens - either a percent marker or a literal string like " lots of " - and match them in sequence. A "%s" can match any string, but only up to the next thing that matches the next token; so the initial "%s" cannot possibly match the "lots of" near the end of the string - the parser won't even consider it. Toss in a few variants like "%[a-z]" which can match any sequence of the characters a-z, "%4s" which must match precisely four characters, and "%*s" which matches without returning the value, and you can do a lot of parsing without ever worrying about exponential parsing time. The REXX equivalent would be: data = "There's lots of spam and 12 eggs here, that's lots of eggs" parse var data prefix " lots of " spam " and " eggs " eggs" but it's been a very very long time since I did any advanced REXX parsing, so I can't remember all the details of what it's capable of. ChrisA
[Tim]
That leaves the happy 5% who write "[^X]*X", which finally says what they intended from the start.
[Steven]
Doesn't that only work if X is literally a single character?
RIght. It was an examp[e, not a meta-example. Even for a _single character_, "match up to the next, but never more or less than that" is a puzzle for most regexp users. [Chris]
Yes, but if X is actually "spam", then you can probably do other assertions to guarantee the right match. It gets pretty clunky though.
Assertions aren't needed, but it is nightmarish to get right. (|[^s]|s(|[^p]|p(|[^a]|a(|[^m]))))*spam The "spam" at the end is the only obvious part ;-) Before then, we match 0 or more instances of nothing or not 's' or 's' followed by nothing or not 'p' or 'p' followed by nothing or not 'a' or 'a' followed by nothing or not 'm' "spam" itself can't get through that maze, so backtracking into it after its first match can't consume the matched "spam" to find a later one. In SNOBOL, as I recall, it could be spelled ARB "spam" FENCE Those are all pattern objects, and infix whitespace is a binary pattern catenation operator. ARB is a builtin pattern that matches the empty string at first, and extends what it matches by one character each time it's backtracked into. "spam" matches the obvious string. Then FENCE is a builtin pattern that matches an empty string, but acts as a backtracking barrier: if the overall match attempt fails, backtracking will not move "to the left" of FENCE. So, here, ARB will not get a chance to consume more characters after the leftmost "spam" is found.
On Tue, 15 Feb 2022 at 13:57, Tim Peters <tim.peters@gmail.com> wrote:
In SNOBOL, as I recall, it could be spelled
ARB "spam" FENCE
Those are all pattern objects, and infix whitespace is a binary pattern catenation operator.
ARB is a builtin pattern that matches the empty string at first, and extends what it matches by one character each time it's backtracked into.
"spam" matches the obvious string.
Then FENCE is a builtin pattern that matches an empty string, but acts as a backtracking barrier: if the overall match attempt fails, backtracking will not move "to the left" of FENCE. So, here, ARB will not get a chance to consume more characters after the leftmost "spam" is found.
Ah, so that's a bit more complicated than the "no-backtracking" parsing style of REXX and scanf. Does this guarantee anything about performance, or is it primarily to allow you to write patterns that can't mismatch strings? ChrisA
[Tim]
In SNOBOL, as I recall, it could be spelled
ARB "spam" FENCE
[Chris]
Ah, so that's a bit more complicated than the "no-backtracking" parsing style of REXX and scanf.
Oh, a lot more complex. In SNOBOL, arbitrary computation can be performed at any point in pattern matching.. Yet _simple_ things are simple to say. Going back to the original one-letter "find the leftmost 'x' after the current point" is just BREAK('X') 'X' BREAK is a builtin function that returns a pattern that consumes all the characters until the first occurrence of one of the letters in its argument. If it's backed up into via backtracking, it fails: the shortest string is the only one it will consume. If you want something fancier, fine - you can code anything you like. For example, here's a pattern object (BAL) that matches the longest non-empty substring balanced with respect to parentheses: BALEXP = NOTANY('()') '(' ARBNO( *BALEXP) ')' BAL = BALEXP ARBNO(BALEXP) ARBNO is akin to a regexp .*? quantifier. The '*' in front of BALEXP in the first line tells SNOBOL to delay evaluating BALEXP until the pattern runs; at the time BALEXP is _defined_ by that line, BALEXP isn't yet bound to anything useful.
Does this guarantee anything about performance, or is it primarily to allow you to write patterns that can't mismatch strings?
They were inventing "pattern matching" in the 1960s. and were looking for expressive notations that were easy for experts to use. Regular expressions were much more of theoretical interest at the time, and offered far feebler abilities than SNOBOL's pattern abilities anyway. It's still interesting to me that they settled on a relatively "wordy" notation, wh\ch is pretty easy to pick up. This is perhaps because regexps are so feeble in comparison that they allowed proving mountains of properties, so a hyper-concise notation for them was invented to make publishing long-winded proofs possible ;-) But while SNOBOL was wildly inventive for its time, it was always intended to be used "to get real work done". Note too that potential backtracking isn't hiding in notation: it's quite explicit in names like ARB and ARBNO, and, of course, in pattern alternation (the infix binary "|" operator). Most builtin pattern functions and objects in SNOBOL are "one and done" (they match or they don't, and fail if backtracked into). Want a string of digits? SPAN('0123456789'). LIke a regexp's \d+, but will _only_ match the longest string of digits starting at the current point. The possibility for backtracking at the _lexical_ level is almost always a misfeature. Which didn't matter in the seminal lex+yacc scheme. yacc consumed the tokens lex's regexps produced, and if yacc failed to match the grammar lex didn't go on to try a different way of tokenizing. In effect, \d+ in lex acted like SNOBOL's SPAN. regexps didn['t manage the transition from "a tool" to "a solution" gracefully ;-)
[Tim, on trying to match only the next instance of "spam"]
Assertions aren't needed, but it is nightmarish to get right.
Followed by a nightmare that got it wrong. My apologies - that's what I get for trying to show off ;-) It's actually far easier if assertions are used, and I'm too old to bother trying to repair the non-assertion mess: ([^s]|s(?!pam))*spam Bingo. That pattern is easy enough to understand (if not to invent the first time): we can chew up a character if it's not an "s", or if it is an "s" but one _not_ followed immediately by "pam". The "s" at the start of a matched "spam" doesn't satisfy either alternative, so it can't go beyond the leftmost following "spam". Of course the same trick can be used to find just the next occurrence of any fixed string of at least two characters.
On Tue, Feb 15, 2022 at 05:39:33AM -0600, Tim Peters wrote:
([^s]|s(?!pam))*spam
Bingo. That pattern is easy enough to understand
You and I have very different definitions of the word "easy" :-)
(if not to invent the first time): we can chew up a character if it's not an "s", or if it is an "s" but one _not_ followed immediately by "pam".
It is times like this that I am reminded why I prefer to just call string.find("spam") :-) -- Steve
On Wed, 16 Feb 2022 at 00:55, Steven D'Aprano <steve@pearwood.info> wrote:
On Tue, Feb 15, 2022 at 05:39:33AM -0600, Tim Peters wrote:
([^s]|s(?!pam))*spam
Bingo. That pattern is easy enough to understand
You and I have very different definitions of the word "easy" :-)
(if not to invent the first time): we can chew up a character if it's not an "s", or if it is an "s" but one _not_ followed immediately by "pam".
It is times like this that I am reminded why I prefer to just call string.find("spam") :-)
Yeah, regexes always look terrible when they're used for simple examples :) But try matching a line that has (somewhere in it) the word "spam", then whitespace, then a number (or if you prefer: then a sequence of ASCII digits). It's easy to write "spam\s+[0-9]+" and not nearly as easy to write it with method calls. So it makes sense that, when you add a restriction like "the word spam must be the first instance of that in the line" (maybe not common with words, but it certainly would be if you're scanning for a colon or other separator), it should still be written that way. To be honest, I don't think I've ever used method calls for complicated parsing. It's just way too messy. Much easier to reach for a regex, sscanf pattern, or other tool - even if it's not technically perfect. ChrisA
On Wed, Feb 16, 2022 at 01:02:44AM +1100, Chris Angelico wrote:
Yeah, regexes always look terrible when they're used for simple examples :) But try matching a line that has (somewhere in it) the word "spam", then whitespace, then a number (or if you prefer: then a sequence of ASCII digits). It's easy to write "spam\s+[0-9]+"
After this thread, I no longer trust that "easy" regexes will do what they "obviously" look like they should do :-( I'm not trying to be funny or snarky. I *thought* I had a reasonable understanding of regexes, and now I have learned that I don't, and that the regexes I've been writing don't do what I thought they did, and presumedly the only reason they haven't blown up in my face (either performance-wise, or the wrong output) is blind luck. Now I have *three* problems :-( -- Steve
On Wed, 16 Feb 2022 at 09:28, Steven D'Aprano <steve@pearwood.info> wrote:
On Wed, Feb 16, 2022 at 01:02:44AM +1100, Chris Angelico wrote:
Yeah, regexes always look terrible when they're used for simple examples :) But try matching a line that has (somewhere in it) the word "spam", then whitespace, then a number (or if you prefer: then a sequence of ASCII digits). It's easy to write "spam\s+[0-9]+"
After this thread, I no longer trust that "easy" regexes will do what they "obviously" look like they should do :-(
I'm not trying to be funny or snarky.
(That must be rare!)
I *thought* I had a reasonable understanding of regexes, and now I have learned that I don't, and that the regexes I've been writing don't do what I thought they did, and presumedly the only reason they haven't blown up in my face (either performance-wise, or the wrong output) is blind luck.
Now I have *three* problems :-(
I think it's one of those cases where it normally doesn't matter that they don't technically do quite what you thought. Pretending that a regex matches in a simpler way than it actually does is like pretending that the earth is a sphere: technically wrong, but almost always close enough. It's only in the rare cases that it matters, and they usually only show up with the regexps that are so complicated that I wouldn't trust them to not be buggy anyway. (Debugging a regexp is a PAIN, when your main response is just "nope didn't match".) ChrisA
[Steven D'Aprano <steve@pearwood.info>]
After this thread, I no longer trust that "easy" regexes will do what they "obviously" look like they should do :-(
I'm not trying to be funny or snarky. I *thought* I had a reasonable understanding of regexes, and now I have learned that I don't, and that the regexes I've been writing don't do what I thought they did, and presumedly the only reason they haven't blown up in my face (either performance-wise, or the wrong output) is blind luck.
Reading Friedl's book is a cure for the confusion, but not for the angst ;-) I believe the single most practical addition in recent decades has been the introduction of "possessive quantifiers" This is a variant of the "greedy" quantifiers that does what most people at the start _believe_ they do: one-and-done. After its initial match, backtracking into it fails. So, e.g., \s++ matches the longest string of whitespace at the time, period. Why "++"? Regexps ;-) It's essentially gibberish syntax that previously didn't have a sensible meaning. For example,
regex.search("^x+[a-z]{4}k", "xxxxxk") <regex.Match object; span=(0, 6), match='xxxxxk'>
is what we're used to if we're paying attention: sucking up as many x's as possible fails to match (there's nothing for [a-z]{4} to match except the trailing "k"). But we keep backtracking into it, trying to match one less "x" at a time, until [a-z]{4} finally matches the rightmost 4 x's. But make it possessive and the match as a whole fails right away:
regex.search("^x++[a-z]{4}k", "xxxxxk")
++ refuses to give back any of what it matched the first time. At this point there are probably more regexp engines that support this feature than don't. Python's re does not, but the regex extension does., Cutting unwanted chances for backtracking greatly cuts the chance of stumbling into timing disasters. Where does that leave Python:? Pretty much aging itself into obsolescence. Regexps keep "evolving", it appears Fredrik lost interest in keeping up long before he died, and nobody else has stepped up. regex _has_ kept up, but isn't in the core. So "install regex" is ever more the best advice. Note that just slamming possessive quantifiers into CPython's engine isn't a good approach for more than just the obvious reasons: possessive quantifiers are themselves just syntax sugar (or chili peppers) for one instance of a more general new feature, "atomic groups". Another that's all but a de facto industry standard now, which Python's re doesn't support (but regex does). Putting just part of that in is half-assed.
Now I have *three* problems :-(
You're quite welcome ;-)
On Wed, 16 Feb 2022 at 12:56, Tim Peters <tim.peters@gmail.com> wrote:
Regexps keep "evolving"...
Once upon a time, a "regular expression" was a regular grammar. That is no longer the case. Once upon a time, a regular expression could be broadly compatible with multiple different parser engines. That is being constantly eroded. So far, I think they still count as expressions. That's about all we can depend on. Is there any sort of standardization of regexp syntax and semantics, or does everyone just extend it in their own directions, borrowing ideas from each other to give some not-always-false assurance of compatibility? ChrisA
On 2022-02-16 02:11, Chris Angelico wrote:
On Wed, 16 Feb 2022 at 12:56, Tim Peters <tim.peters@gmail.com> wrote:
Regexps keep "evolving"...
Once upon a time, a "regular expression" was a regular grammar. That is no longer the case.
Once upon a time, a regular expression could be broadly compatible with multiple different parser engines. That is being constantly eroded.
So far, I think they still count as expressions. That's about all we can depend on.
Is there any sort of standardization of regexp syntax and semantics, or does everyone just extend it in their own directions, borrowing ideas from each other to give some not-always-false assurance of compatibility?
The only regex standard I know of is the POSIX standard, but I don't know of a common implementation that follows it. Most tend to follow Perl, although Perl borrowed named groups from Python, though with a slightly different syntax ("(?<name>...)" instead of "(?P<name>...)").
On Tue, Feb 15, 2022 at 6:13 PM Chris Angelico <rosuav@gmail.com> wrote:
Once upon a time, a "regular expression" was a regular grammar. That is no longer the case.
I use "regex" for the weird backtracking minilanguages and deliberately never call them "regular expressions". (I was under the impression that the Perl documentation observed the same convention but that doesn't seem to be true.) Once upon a time, a regular expression could be broadly compatible with
multiple different parser engines.
I think there never was such a time, at least not if you wanted syntactic compatibility. Is there any sort of standardization of regexp syntax and semantics[...]?
I'm not sure there needs to be. There is no standardization of programming-language syntax in general, unless you count conventions like {...} for blocks which Python ignores. The problem as I see it isn't that the syntax isn't standardized. It's that the syntax, to the extent it is standardized, is terrible. The only traditional Unix tool whose regular expression syntax isn't godawful is lex, and unfortunately that isn't the one that caught on.
[Chris Angelico <rosuav@gmail.com>]
Is there any sort of standardization of regexp syntax and semantics,
Sure. "The nice thing about standards is that you have so many to choose from" ;-) For example, POSIX defines a regexp flavor so it can specify what things like grep do. The ECMAScruot standard defines its own standard, ditto Java, etc.
or does everyone just extend it in their own directions, borrowing ideas from each other to give some not-always-false assurance of compatibility?
In real life, everyone strives to copy what Perl does, because regexps are ubiquitous in Perl and Larry Wall worked hard at the time to put in every useful regexp feature everyone else already had, but with more uniform syntax. Perl's love of regexps strikes me as clinically pathological, but that doesn't diminish my respect for the relative sanity Perl brought to this area. There's a little bit flowing _into_ Perl too. An example close to my heart: Guido and I obtained Larry's promise that he'd never use a (?P prefix, so that Python could use that for its own purposes. Which amounted to introducing the concept of named groups. Which Perl in turn later adopted - although Perl dropped the "P" for named groups. Ah - I see MRAB replied while I was typing this, saying much the same. But I'm worider, so I won't waste the eff\ort ;-)
I know this is probably too much self promotion, but I really enjoyed writing this less than a year ago: https://gnosis.cx/regex/ (The Puzzling Quirks of Regular Expressions). It's like other puzzle books, but for programmers. You should certainly still get Friedl's book if you don't have it. You can read mine online in a couple versions. But the printed one is nice looking, and the artwork shows better). In particular, I'd love to send signed copies to any of the folks with familiar names here. Maybe pay me for shipping, but you don't have to (within the US, media mail is cheap, elsewhere in the world is expensive, unfortunately). Or buy from Lulu if you don't care about autograph... Or read for free, of course. I get a little bit into the obscure theory of this thread. But mostly I think you'll just laugh, and experience occasional confusion. On Tue, Feb 15, 2022, 9:46 PM Tim Peters <tim.peters@gmail.com> wrote:
[Chris Angelico <rosuav@gmail.com>]
Is there any sort of standardization of regexp syntax and semantics,
Sure. "The nice thing about standards is that you have so many to choose from" ;-) For example, POSIX defines a regexp flavor so it can specify what things like grep do. The ECMAScruot standard defines its own standard, ditto Java, etc.
or does everyone just extend it in their own directions, borrowing ideas from each other to give some not-always-false assurance of compatibility?
In real life, everyone strives to copy what Perl does, because regexps are ubiquitous in Perl and Larry Wall worked hard at the time to put in every useful regexp feature everyone else already had, but with more uniform syntax. Perl's love of regexps strikes me as clinically pathological, but that doesn't diminish my respect for the relative sanity Perl brought to this area.
There's a little bit flowing _into_ Perl too. An example close to my heart: Guido and I obtained Larry's promise that he'd never use a (?P prefix, so that Python could use that for its own purposes. Which amounted to introducing the concept of named groups. Which Perl in turn later adopted - although Perl dropped the "P" for named groups.
Ah - I see MRAB replied while I was typing this, saying much the same. But I'm worider, so I won't waste the eff\ort ;-) _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/4SRZOO... Code of Conduct: http://python.org/psf/codeofconduct/
[Tim, on trying to match only the next instance of "spam"]
,,, It's actually far easier if assertions are used, and I'm too old to bother trying to repair the non-assertion mess:
([^s]|s(?!pam))*spam
Since then, Serhiy rehabilitated an old patch to add "atomic groups" and "possessive quantifiers" to CPython's `re`, and merged it today. So in 3.11, you'll be able to do the far easier: (?>.*?spam) instead. (?>,,,) delimits an "atomic group", and `...` can be any regexp. The group is non-capturing. `...` is matched in the normal way, but with a twist: _after_ it matches (if it ever does), it's done. The first match it finds is also the last match it will try. If, after succeeding, the overall match fails to the right of the atomic group, backing up into the group fails at once - no other alternatives for `...` are tried. So, in the example, .*?spam finds the closest (if any) following instance of "spam", and that's all. It won't ever go on to try to match a later instance of "spam". Which is probably what most people had in mind all along when they wrote plain .*?spam but were instead delighted by mysterious cases of exponential-time backtracking ;-) "Possessive quantif\iers" are syntactic sugar for particular simple instances of atomic groups. Where R is a regexp, R++ is short for (?>R+) R*+ is short for (?>R*) R{m,n}+ is short for (?>R{m,n}) R?+ is short for (?>R?) In all cases, they take the longest match possible for R starting right here, right now, without any consideration for whether that may or may not cause the rest of the pattern to fail, and then they hold on to that longest-possible match no matter what. No backtracking after the first success. You should feel free to use these in 3.11. There are few modern regexp engines that don't support them, and they were the most conspicuous of the "missing features" in Python's `re`.
[Steven D'Aprano <steve@pearwood.info>]
I've been interested in the existence of SNOBOL string scanning for a long time, but I know very little about it.
How does it differ from regexes, and why have programming languages pretty much standardised on regexes rather than other forms of string matching?
What we call "regexps" today contain all sorts of things that aren't in the original formal definition of "regular expressions". For example, even the ubiquitous "^" and "$" (start- and end-of-line assertions) go beyond what the phrase formally means. So the question is ill-defined. When Perl added recursive regular expressions, I'm not sure there's any real difference in theoretical capability remaining. Without that, though, and for example, you can't write a regular expression that matches strings with balanced parentheses ("regexps can't count"), while I earlier posted a simple 2-liner in SNOBOL that implements such a thing (patterns in SNOBOL can freely invoke other patterns, including themselves). As to why regexps prevailed, traction! They are useful tools, and _started_ life as pretty simple things, with small, elegant, and efficient implementations Feature creep and "faster! faster! faster!" turned the implementations more into bottomless pits now ;-) Adoption breeds more adoption in the computer world. They have no real competition anymore. The same sociological illness has also cursed us, e.g., with an eternity of floating point signed zeroes ;--) Chris didn't say this, but I will: I'm amazed that things much _simpler_ than regexps, like his scanf and REXX PARSE examples,,haven't spread more. Simple solutions to simple problems are very appealing to me. Although, to be fair, I get a kick too out of massive overkill ;l-)
Tim Peters writes:
Chris didn't say this, but I will: I'm amazed that things much _simpler_ than regexps, like his scanf and REXX PARSE examples, haven't spread more.
scanf just isn't powerful enough. For example, consider parsing user input dates: scanf("%d/%d/%d", &year, &month, &day). This is nice and simple, but handling "2022-02-15" as well requires a bit of thinking and several extra statements in C. In Python, I guess it would probably look something like year, sep1, month, sep2, day = scanf("%d%c%d%c%d") if not ('/' == sep1 == sep2 or '-' == sep1 == sep2): raise DateFormatUnacceptableError # range checks for month and day go here which isn't too bad, though. But year, month, day = re.match(r"(\d+)[-/](\d+)[-/](\d+)").groups() if not sep1 == sep2: raise DateFormatUnacceptableError # range checks for month and day go here expresses the intent a lot more clearly, I think. Sure, it's easy to write uninterpretable regexps, but up to that point regexps are very expressive. And that example can be reduced to one line (plus the comment) at the expense of a less symmetric, slightly less readable expression like r"(\d+)([-/])(\d+)\2(\d+)". Some folks might like that one better.
Simple solutions to simple problems are very appealing to me.
The Zawinski quote is motivated by the perception that people seem to think that simplicity lies in minimizing the number of tools you need to learn. REXX and SNOBOL pattern matching quite a bit more specialized to particular tools than regexps. That is, all regexp implementations support the same basic language which is sufficient for most tasks most programmers want regexps for. I think you'd need to implement such a facility in a very popular scripting language such as sh, Perl, or Python for it to have the success of regexps.
Although, to be fair, I get a kick too out of massive overkill ;l-)
Don't we all, though? Steve
On Wed, 16 Feb 2022 at 01:54, Stephen J. Turnbull <stephenjturnbull@gmail.com> wrote:
The Zawinski quote is motivated by the perception that people seem to think that simplicity lies in minimizing the number of tools you need to learn. REXX and SNOBOL pattern matching quite a bit more specialized to particular tools than regexps. That is, all regexp implementations support the same basic language which is sufficient for most tasks most programmers want regexps for.
The problem is that that's an illusion. If you restrict yourself to the subset that's supported by every regexp implementation, you'll quickly find tasks that you can't handle. If you restrict yourself to what you THINK is the universal subset, you end up with something that has a subtle difference when you use it somewhere else (I've had this problem with grep and Python, where a metacharacter in one was a plain character in the other - also frequently a problem between grep and sed, with the consequent "what do I need to escape?" problem). But as the OP has found, regexps are a hammer that, for some nail-like problems, will whack an arbitrary number of times before hitting. So I guess the question isn't "why are regular expressions so popular" but "why are other things not ALSO popular". I honestly think that scanf parsing, if implemented ad-hoc by different programming languages and extended to their needs, would end up no less different from each other than different regexp engines are - the most-used parts would also be the most-compatible, just like with regexps. ChrisA
Chris Angelico writes:
On Wed, 16 Feb 2022 at 01:54, Stephen J. Turnbull <stephenjturnbull@gmail.com> wrote:
That is, all regexp implementations support the same basic language which is sufficient for most tasks most programmers want regexps for.
The problem is that that's an illusion.
It isn't for me. I write a lot of regexps for several languages (Python, grep, sed, Emacs Lisp), I rarely have to debug one, and in a year there may be one that debugging requires more than reading each character out loud and recognizing a typo. As a sometime Emacsen dev, I also do a fair amount of debugging of other people's regexps. Yuck! But it's almost always the case that (modulo efficiency considerations) it's pretty easy to figure out what they *want*, and rewrite the code (*not* the *regexp(s)*!) to use simpler regexps (usually parts of the original) in a more controlled way.
If you restrict yourself to the subset that's supported by every regexp implementation, you'll quickly find tasks that you can't handle.
That's true of everything in programming, there are no tools that can handle everything until you have a Turing-complete programming language, and even then, practically there are always things that are too painful even for masochists to do in that language. But with regexps, I don't, you see. Besides regexps, I write a lot of (trivial to simple) parsers. For the regexps, I don't need much more than ()[]+*?.\s\d\t\r\n most of the time (and those last 3 are due to tab-separated value files and RFC 822, special cases). I could probably use scanf (except that Python, sed, and Emacs Lisp don't have it ;-) but for the lack of []. Occasionally for things like date parsing and other absolutely fixed-field contexts I'll use {}. I do sanity-checking on the result frequently. If the regexp engine supports it, I'll use named groups and other such syntactic sugar. In a general purpose programming language, if it supports "literate regexps", I use those, if not, I use separate strings (which also makes it easy to back out of a large regexp into statements if I need to establish backtracking boundaries). Sure, if you want to do all of that *in* a single regexp, you are indeed going to run into things you can't do that way. When I wrote, "what people want to do" I meant tasks where regexps could do a lot of the task, but not that they could do the whole thing in one regexp. For my style, regexps are something that's available in a very wide array of contexts, and consistent enough to get the job done. I treat complex regexps the way I treat C extensions: only if the performance benefit is human-perceptible, which is pretty rare.
"why are other things not ALSO popular". I honestly think that scanf parsing, if implemented ad-hoc by different programming languages and extended to their needs, would end up
... becoming regexp-like, and not just in the sense of
no less different from each other than different regexp engines are - the most-used parts would also be the most-compatible, just like with regexps.
;-) What I think is more interesting than simpler (but more robust for what they can do) facilities is better parser support in standard libraries (not just Python's), and more use of them in place of hand-written "parsers" that just eat tokens defined by regexps in order. If one could, for example, write [ "Sun|Mon|Tue|Wed|Thu|Fri|Sat" : dow, ", ". "(?: |\d)\d)" : day, " ", "Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec" : month, " ", "\d\d\d\d" : year, " ", "\d\d:\d\d:\d\d" : time, " ", "[+-]\d\d\d\d" : tzoffset ] (which is not legal Python syntax but I'm too lazy to try to come up with something better) to parse an RFC 822 date, I think people would use that. Sure, for something *that* regular, most people would probably use the evident "literate" regexp with named groups, but it wouldn't take much complexity to make such a parser generator worthwhile to programmers. Steve
On Wed, 16 Feb 2022 at 21:01, Stephen J. Turnbull <stephenjturnbull@gmail.com> wrote:
Chris Angelico writes:
On Wed, 16 Feb 2022 at 01:54, Stephen J. Turnbull <stephenjturnbull@gmail.com> wrote:
That is, all regexp implementations support the same basic language which is sufficient for most tasks most programmers want regexps for.
The problem is that that's an illusion.
It isn't for me. I write a lot of regexps for several languages (Python, grep, sed, Emacs Lisp), I rarely have to debug one, and in a year there may be one that debugging requires more than reading each character out loud and recognizing a typo.
I've used simple regexps in sed and grep, and found differences about what needs to be escaped, so even when you don't use the advanced features, you need to be aware of them.
But with regexps, I don't, you see. Besides regexps, I write a lot of (trivial to simple) parsers. For the regexps, I don't need much more than ()[]+*?.\s\d\t\r\n most of the time (and those last 3 are due to tab-separated value files and RFC 822, special cases). I could probably use scanf (except that Python, sed, and Emacs Lisp don't have it ;-) but for the lack of []. Occasionally for things like date parsing and other absolutely fixed-field contexts I'll use {}. I do sanity-checking on the result frequently.
Not sure what you mean by "lack of []", but some scanf variants do support that - for instance, %[a-z] will only match lowercase alpha.
If the regexp engine supports it, I'll use named groups and other such syntactic sugar. In a general purpose programming language, if it supports "literate regexps", I use those, if not, I use separate strings (which also makes it easy to back out of a large regexp into statements if I need to establish backtracking boundaries).
That's what I mean about the illusion. You can't use named groups in all regexp engines.
"why are other things not ALSO popular". I honestly think that scanf parsing, if implemented ad-hoc by different programming languages and extended to their needs, would end up
... becoming regexp-like, and not just in the sense of
no less different from each other than different regexp engines are - the most-used parts would also be the most-compatible, just like with regexps.
;-)
Heh, probably true :)
What I think is more interesting than simpler (but more robust for what they can do) facilities is better parser support in standard libraries (not just Python's), and more use of them in place of hand-written "parsers" that just eat tokens defined by regexps in order. If one could, for example, write
[ "Sun|Mon|Tue|Wed|Thu|Fri|Sat" : dow, ", ". "(?: |\d)\d)" : day, " ", "Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec" : month, " ", "\d\d\d\d" : year, " ", "\d\d:\d\d:\d\d" : time, " ", "[+-]\d\d\d\d" : tzoffset ]
(which is not legal Python syntax but I'm too lazy to try to come up with something better) to parse an RFC 822 date, I think people would use that. Sure, for something *that* regular, most people would probably use the evident "literate" regexp with named groups, but it wouldn't take much complexity to make such a parser generator worthwhile to programmers.
That's an interesting concept. I can imagine writing it declaratively like this: class Date(parser): dow: "Sun|Mon|Tue|Wed|Thu|Fri|Sat" _: ", " day: "(?: |\d)\d)" _: " " month: "Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec" _: " " year: "\d\d\d\d" _: " " time: "\d\d:\d\d:\d\d" _: " " tzoffset: "[+-]\d\d\d\d" Would it be better than a plain regex? Not sure. ChrisA
On Wed, 16 Feb 2022 at 10:23, Chris Angelico <rosuav@gmail.com> wrote:
On Wed, 16 Feb 2022 at 21:01, Stephen J. Turnbull <stephenjturnbull@gmail.com> wrote:
What I think is more interesting than simpler (but more robust for what they can do) facilities is better parser support in standard libraries (not just Python's), and more use of them in place of hand-written "parsers" that just eat tokens defined by regexps in order. If one could, for example, write
[ "Sun|Mon|Tue|Wed|Thu|Fri|Sat" : dow, ", ". "(?: |\d)\d)" : day, " ", "Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec" : month, " ", "\d\d\d\d" : year, " ", "\d\d:\d\d:\d\d" : time, " ", "[+-]\d\d\d\d" : tzoffset ]
(which is not legal Python syntax but I'm too lazy to try to come up with something better) to parse an RFC 822 date, I think people would use that. Sure, for something *that* regular, most people would probably use the evident "literate" regexp with named groups, but it wouldn't take much complexity to make such a parser generator worthwhile to programmers.
That's an interesting concept. I can imagine writing it declaratively like this:
class Date(parser): dow: "Sun|Mon|Tue|Wed|Thu|Fri|Sat" _: ", " day: "(?: |\d)\d)"
I find it mildly amusing that even this "better" solution fell victim to an incorrect regexp ;-) However, I do like the idea of having a better parser library in the stdlib. But it's pretty easy to write such a thing and publish it on PyPI, so the lack of an obvious "best in class" answer for this problem suggests that people would be less likely to use such a feature than we're assuming. The two obvious examples on PyPI are: 1. PyParsing - https://pypi.org/project/pyparsing/. To me, this has the feel of the sort of functional approach SNOBOL used. 2. parse - https://pypi.org/project/parse/. A scanf-style approach inspired by format rather than printf. Do people choose regexes over these because re is in the stdlib? Are they simply less well known? Or is there an attraction to regexes that makes people prefer them in spite of the complexity/maintainability issues? Paul
On Wed, Feb 16, 2022, 5:46 AM Paul Moore <p.f.moore@gmail.com> wrote:
On Wed, 16 Feb 2022 at 10:23, Chris Angelico <rosuav@gmail.com> wrote:
On Wed, 16 Feb 2022 at 21:01, Stephen J. Turnbull <stephenjturnbull@gmail.com> wrote:
What I think is more interesting than simpler (but more robust for what they can do) facilities is better parser support in standard libraries (not just Python's), and more use of them in place of hand-written "parsers" that just eat tokens defined by regexps in order. If one could, for example, write
[ "Sun|Mon|Tue|Wed|Thu|Fri|Sat" : dow, ", ". "(?: |\d)\d)" : day, " ", "Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec" : month, " ", "\d\d\d\d" : year, " ", "\d\d:\d\d:\d\d" : time, " ", "[+-]\d\d\d\d" : tzoffset ]
(which is not legal Python syntax but I'm too lazy to try to come up with something better) to parse an RFC 822 date, I think people would use that. Sure, for something *that* regular, most people would probably use the evident "literate" regexp with named groups, but it wouldn't take much complexity to make such a parser generator worthwhile to programmers.
That's an interesting concept. I can imagine writing it declaratively like this:
class Date(parser): dow: "Sun|Mon|Tue|Wed|Thu|Fri|Sat" _: ", " day: "(?: |\d)\d)"
I find it mildly amusing that even this "better" solution fell victim to an incorrect regexp ;-)
However, I do like the idea of having a better parser library in the stdlib. But it's pretty easy to write such a thing and publish it on PyPI, so the lack of an obvious "best in class" answer for this problem suggests that people would be less likely to use such a feature than we're assuming.
The two obvious examples on PyPI are:
1. PyParsing - https://pypi.org/project/pyparsing/. To me, this has the feel of the sort of functional approach SNOBOL used. 2. parse - https://pypi.org/project/parse/. A scanf-style approach inspired by format rather than printf.
Do people choose regexes over these because re is in the stdlib? Are they simply less well known? Or is there an attraction to regexes that makes people prefer them in spite of the complexity/maintainability issues?
Paul
Long story below but TLDR: I tried to use parse for a task I worked on for a long time, eventually had to learn regex. After using regex somewhat regularly for a while now I concluded the power and ubiquity of it is worth the additional cognitive load (parse only requires one to be familiar with standard python string format syntax). Story: The first task I set about trying to do in python (with no practical programming experience except for a single semester of c++ as part of a civil engineering curriculum) was a tool to convert 2D finite element mesh files into a file format for a niche finite element analysis program (the program is called CANDE; it's for analysis of buried culverts and pipes). My predecessor was creating these meshes by hand. He would literally get a 24"x36" of drafting paper and draw out his mesh and number the nodes and elements and enter the data into the text file. It took me eons to write something (probably 6 years!), I probably started over from scratch at least 7, maybe 10 times. And even after all that while I finally did arrive at something usable for myself, I never achieved my goal of being able to package something up I can pass on to my other colleague. When a new mesh has to be created they just ask me to do it (I still do them occasionally). Anyway all that to say: I remember trying to avoid learning regex for about 4 years of this. It looked too scary. One day I finally ran into a task that parse: https://pypi.org/project/parse/ ...which I was relying heavily on, couldn't handle. I researched it and am pretty confident that even in my relative ignorance I am/was correct about it but being able to do it (I am racking my brain but can't remember what that need was). This prompted me to FINALLY do a few regex tutorials and watch some pycon videos and I came out the other end realizing that, hey, regex isn't so bad. And it's so darn powerful that the trade off between it being an uphill task to learn and read (at least as first) and what you are able to do with it seems worth it.
Paul Moore writes:
However, I do like the idea of having a better parser library in the stdlib. But it's pretty easy to write such a thing and publish it on PyPI,
It's not easy to write a good one. I've tried in two languages (Python and Lisp). I'm not saying I'm competent, so that's a faint damn.
so the lack of an obvious "best in class" answer for this problem suggests that people would be less likely to use such a feature than we're assuming.
To me it suggests that it's hard to come up with a UI that's clearly better than regexp.
Do people choose regexes over these because re is in the stdlib?
Almost certainly, to some extent.
Or is there an attraction to regexes that makes people prefer them in spite of the complexity/maintainability issues?
I think that a lot of it is that people are extremely comfortable with globbing, and re at first glance looks like globbing on steroids with no downside. On the other hand, parsing is "known" (note scare quotes) to be hard, people *hate* "grammar", and unlike grep and sed which are simple enough to be usable at the command line, not only isn't lex embedded in yacc, but both really need to be saved in files and compiled, then run, to be tested. It just feels really heavy compared to regexp. I think that if experienced people used parsers more they would get general uptake. But people seem to think that parsers are for experts, more so than regexp. That's why I chose a simple list-with-annotations rather than something like Chris's idea which in Python looks like magic. "If we could just make it *this* simple ..." Steve
On Tue, Feb 15, 2022 at 11:51:41PM +0900, Stephen J. Turnbull wrote:
scanf just isn't powerful enough. For example, consider parsing user input dates: scanf("%d/%d/%d", &year, &month, &day). This is nice and simple, but handling "2022-02-15" as well requires a bit of thinking and several extra statements in C. In Python, I guess it would probably look something like
year, sep1, month, sep2, day = scanf("%d%c%d%c%d") if not ('/' == sep1 == sep2 or '-' == sep1 == sep2): raise DateFormatUnacceptableError # range checks for month and day go here
Assuming that scanf raises if there is no match, I would probably go with: try: # Who writes ISO-8601 dates using slashes? day, month, year = scanf("%d/%d/%d") if ALLOW_TWO_DIGIT_YEARS and len(year) == 2: year = "20" + year except ScanError: year, month, day = scanf("%d-%d-%d")
which isn't too bad, though. But
year, month, day = re.match(r"(\d+)[-/](\d+)[-/](\d+)").groups() if not sep1 == sep2: raise DateFormatUnacceptableError # range checks for month and day go here
Doesn't that raise an exception? NameError: name 'sep1' is not defined I think that year, sep1, month, sep2, day = re.match(r"(\d+)([-/])(\d+)([-/])(\d+)").groups() might do it (until Tim or Chris tell me that actually is wrong). Or use \2 as you suggest later on.
expresses the intent a lot more clearly, I think.
Noooo, I don't think it does. The scanf (hypothetical) solution is a lot closer to my intent. But yes, regexes are more powerful: you can implement scanf using regexes, but you can't implement regexes using scanf. -- Steve
On Wed, 16 Feb 2022 at 10:15, Steven D'Aprano <steve@pearwood.info> wrote:
On Tue, Feb 15, 2022 at 11:51:41PM +0900, Stephen J. Turnbull wrote:
scanf just isn't powerful enough. For example, consider parsing user input dates: scanf("%d/%d/%d", &year, &month, &day). This is nice and simple, but handling "2022-02-15" as well requires a bit of thinking and several extra statements in C. In Python, I guess it would probably look something like
year, sep1, month, sep2, day = scanf("%d%c%d%c%d") if not ('/' == sep1 == sep2 or '-' == sep1 == sep2): raise DateFormatUnacceptableError # range checks for month and day go here
Assuming that scanf raises if there is no match, I would probably go with:
Having scanf raise is one option; another option would be to have it return a partial result, which would raise ValueError when unpacked in this simple way. (Partial results are FAR easier to debug than a simple "didn't match", plus they can be extremely useful in some situations.)
try: # Who writes ISO-8601 dates using slashes? day, month, year = scanf("%d/%d/%d") if ALLOW_TWO_DIGIT_YEARS and len(year) == 2: year = "20" + year except ScanError: year, month, day = scanf("%d-%d-%d")
It all depends on what your goal is. Do you want to support multiple different formats (d/m/y, y-m-d, etc)? Do you want one format with multiple options for delimiter? Is it okay if someone mismatches delimiters? Most likely, I'd not care if someone uses y/m-d, but I wouldn't allow d/m/y or m/d/y, so I'd write it like this: year, month, day = scanf("%d%*[-/]%d%*[-/]%d") But realistically, if we're doing actual ISO 8601 date parsing, then *not one of these is correct*, and we should be using an actual ISO 8601 library :) The simple cases like log file parsing are usually consuming the output of exactly one program, so you can mandate the delimiter completely. Here's something that can parse the output of 'git blame': commit, name, y,m,d, h,m,s, tz, line, text = \ scanf("%s (%s %d-%d-%d %d:%d:%d %d %d) %s") (Of course, you should use --porcelain instead, but this is an example.) There's a spectrum of needs, and a spectrum of tools that can fulfil them. At one extreme, simple method calls, the "in" operator, etc - very limited, very fast, easy to read. At the other extreme, full-on language parsers with detailed grammars. In between? Well, sscanf is a bit simpler than regexp, REXX's parse is probably somewhere near sscanf, SNOBOL is probably a bit to the right of regexp, etc, etc, etc. We shouldn't have to stick to a single tool just because it's capable of spanning a wide range.
I think that
year, sep1, month, sep2, day = re.match(r"(\d+)([-/])(\d+)([-/])(\d+)").groups()
might do it (until Tim or Chris tell me that actually is wrong).
Or use \2 as you suggest later on.
Yeah, \2 much more clearly expresses the intent of "take either of these characters, and then match another of that character". ChrisA
On 2022-02-15 06:05, Tim Peters wrote:
[Steven D'Aprano <steve@pearwood.info>]
I've been interested in the existence of SNOBOL string scanning for a long time, but I know very little about it.
How does it differ from regexes, and why have programming languages pretty much standardised on regexes rather than other forms of string matching?
What we call "regexps" today contain all sorts of things that aren't in the original formal definition of "regular expressions". For example, even the ubiquitous "^" and "$" (start- and end-of-line assertions) go beyond what the phrase formally means.
So the question is ill-defined. When Perl added recursive regular expressions, I'm not sure there's any real difference in theoretical capability remaining. Without that, though, and for example, you can't write a regular expression that matches strings with balanced parentheses ("regexps can't count"), while I earlier posted a simple 2-liner in SNOBOL that implements such a thing (patterns in SNOBOL can freely invoke other patterns, including themselves).
As to why regexps prevailed, traction! They are useful tools, and _started_ life as pretty simple things, with small, elegant, and efficient implementations Feature creep and "faster! faster! faster!" turned the implementations more into bottomless pits now ;-)
Adoption breeds more adoption in the computer world. They have no real competition anymore. The same sociological illness has also cursed us, e.g., with an eternity of floating point signed zeroes ;--)
Chris didn't say this, but I will: I'm amazed that things much _simpler_ than regexps, like his scanf and REXX PARSE examples,,haven't spread more. Simple solutions to simple problems are very appealing to me. Although, to be fair, I get a kick too out of massive overkill ;l-)
Regexes were simple to start with, so only a few metacharacters were needed, the remaining characters being treated as literals. As new features were added, the existing metacharacters were used in new ways that had been illegal until then in order to remain backwards-compatible. Add to that that there are multiple implementations with differing (and sometimes only slightly differing) features and behaviours. It's a good example of evolution: often messy, and resulting in clunky designs.
Tim Peters wrote:
""" Some people, when confronted with a problem, think “I know, I'll use regular expressions.” Now they have two problems. - Jamie Zawinski """
Maybe so, but I'm committed now :). I have dozens of regexes to parse specific log messages I'm interested in. I made a little DSL that uses regexes with capture groups, and if the regex matches, takes the resulting groupdict and optionally applies further transformations on the individual fields. This allows me to very concisely specify what I want to extract before doing further analysis and aggregation on the resulting fields. For example: flush_end = Rule( Capture( # Completed flushing /u01/data02/tb_tbi_project02_prd/data_launch_index-4a5f72725b7211eaab635720a1b8a299/aa-26507-bti-Data.db (46.528MiB) for commitlog position CommitLogPosition(segmentId=1615955816662, position=223538288) # Completed flushing /dse/data02/OpsCenter/rollup_state-7b621931ab7511e8b862810a639403e5/bb-21969-bti-Data.db (7.763MiB/2.197MiB on disk/1 files) for commitlog position CommitLogPosition(segmentId=1637403836277, position=9927158) r"Completed flushing (?P<sstable>[^ ]+) \((?P<bytes_flushed>[^)/]+)(/(?P<bytes_on_disk>[^ ]+) on disk/(?P<file_count>[^ ]+) files)?\) for commitlog position CommitLogPosition\(segmentId=(?P<commitlog_segment>[^,]+), position=(?P<commitlog_position>[^)]+)\)" ), Convert( normval, "bytes_flushed", "bytes_on_disk", "commitlog_segment", "commitlog_position", ), table_from_sstable, ) I know there are specialized tools like logstash but it's nice to be able to specify the extraction and subsequent analysis together in Python.
reason to change that. Naive regexps are both clumsy and prone to bad timing in many tasks that "should be" very easy to express. For example, "now match up to the next occurrence of 'X'". In SNOBOL and Icon, that's trivial. 75% of regexp users will write ".*X", with scant understanding that it may match waaaay more than they intended. Another 20% will write ".*?X", with scant understanding that may extend beyond _just_ "the next" X in some cases. That leaves the happy 5% who write "[^X]*X", which finally says what they intended from the start.
If you look in my regex in the example above, you will see that the "[^X]*X" is exactly what I did. The pathological case arose from a simple typo where I had an extra + after a capture group that I failed to notice, and which somehow worked correctly on the expected input but ran forever when the expected terminating character appeared more times than expected in the input string.
Well, I certainly sparked a lot of interesting discussion, which I have quite enjoyed reading. But to bring this thread back around to its original topic, is there support among the Python maintainers for adding a timeout feature to the Python re library? I will look at the third-party regex library that Jonathan suggested but I still believe a timeout option would be a valuable feature to have in the standard library.
On Thu, 17 Feb 2022 at 08:33, J.B. Langston <jblangston@datastax.com> wrote:
Well, I certainly sparked a lot of interesting discussion, which I have quite enjoyed reading. But to bring this thread back around to its original topic, is there support among the Python maintainers for adding a timeout feature to the Python re library? I will look at the third-party regex library that Jonathan suggested but I still believe a timeout option would be a valuable feature to have in the standard library.
I'm not a maintainer, but I'd personally be against a timeout. It would add overhead to common cases in order to put a shield around pathological ones, and it's difficult to impossible to usefully define the cutoff. Instead, I'd recommend trying some of the simpler parsing options, as explored in the ensuing discussion, to see if one of those has better worst-case performance while still being able to do what's needed. (Hence all the discussion of "no-backtracking" options.) ChrisA
[J.B. Langston <jblangston@datastax.com>]
Well, I certainly sparked a lot of interesting discussion, which I have quite enjoyed reading. But to bring this thread back around to its original topic, is there support among the Python maintainers for adding a timeout feature to the Python re library?
Buried in the fun discussion was my guess: no way. Python's re is effectively dead legacy code, with no current "owner". Its commit history shows very little activity for some years already. Mos\ commits are due to generic "cod\e cleanup" crusades that have nothing specific to do with the algorithms. None required non-triv\ial knowledge of the implementation. Here's the most recent I found that actually changed behavior: """ commit 6cc8ac949907b9a1c0f73709c6978b7a43e634e3 Author: Zackery Spytz <zspytz@gmail.com> Date: Fri May 21 14:02:42 2021 -0700 bpo-40736: Improve the error message for re.search() TypeError (GH-23312) Include the invalid type in the error message. """" A trivial change.
I will look at the third-party regex library that Jonathan suggested but I still believe a timeout option would be a valuable feature to have in the standard library.
Which is the problem: regex has _dozens_ of features that would be valuable to have in the standard library. reg\ex is in fact one of the best regexp libraries on the planet. It already has timeouts, and other features (like possessive quantifiers) that are actually (unlike timeouts) frequently asked about by many programmers. In fact regex started life intending to go into core Python, in 2008: https://bugs.python.org/issue3825 That stretched on and on, and the endless bikeshedding eventually appeared to fizzle out in 2014: https://bugs.python.org/issue2636 In 2021 a core dev eventually rejected it, as by then MRAB had long since released it as a successful extension module. I assume - but don't know - he got burned out by "the endless bikeshedding" on those issue reports. In any cose, no, no core dev I know of is going to devote their limited time to reproducing a tiny subset of regex's many improvements in Python's legacy engine. In fact, "install regex!" is such an obvious choice at this point that I wouldn't even give time to just reviewing a patch that added timeouts. BTW, I didn't mention regex in your BPO report because I didn't know at the time it already implemented timeouts. I learned that in this thread.
On 2022-02-16 22:13, Tim Peters wrote:
[J.B. Langston <jblangston@datastax.com>]
Well, I certainly sparked a lot of interesting discussion, which I have quite enjoyed reading. But to bring this thread back around to its original topic, is there support among the Python maintainers for adding a timeout feature to the Python re library?
Buried in the fun discussion was my guess: no way. Python's re is effectively dead legacy code, with no current "owner". Its commit history shows very little activity for some years already. Mos\ commits are due to generic "cod\e cleanup" crusades that have nothing specific to do with the algorithms. None required non-triv\ial knowledge of the implementation.
Here's the most recent I found that actually changed behavior:
""" commit 6cc8ac949907b9a1c0f73709c6978b7a43e634e3 Author: Zackery Spytz <zspytz@gmail.com> Date: Fri May 21 14:02:42 2021 -0700
bpo-40736: Improve the error message for re.search() TypeError (GH-23312)
Include the invalid type in the error message. """"
A trivial change.
I will look at the third-party regex library that Jonathan suggested but I still believe a timeout option would be a valuable feature to have in the standard library.
Which is the problem: regex has _dozens_ of features that would be valuable to have in the standard library. reg\ex is in fact one of the best regexp libraries on the planet. It already has timeouts, and other features (like possessive quantifiers) that are actually (unlike timeouts) frequently asked about by many programmers.
In fact regex started life intending to go into core Python, in 2008:
https://bugs.python.org/issue3825
That stretched on and on, and the endless bikeshedding eventually appeared to fizzle out in 2014:
https://bugs.python.org/issue2636
In 2021 a core dev eventually rejected it, as by then MRAB had long since released it as a successful extension module. I assume - but don't know - he got burned out by "the endless bikeshedding" on those issue reports.
I eventually decided against having it added to the standard library because that would tie fixes and additions to Python's release cycle, and there's that adage that Python has "batteries included", but not nuclear reactors. PyPI is a better place for it, for those who need more than what the standard re module provides.
In any cose, no, no core dev I know of is going to devote their limited time to reproducing a tiny subset of regex's many improvements in Python's legacy engine. In fact, "install regex!" is such an obvious choice at this point that I wouldn't even give time to just reviewing a patch that added timeouts.
BTW, I didn't mention regex in your BPO report because I didn't know at the time it already implemented timeouts. I learned that in this thread.
[MRAB <python@mrabarnett.plus.com>]
I eventually decided against having it added to the standard library because that would tie fixes and additions to Python's release cycle, and there's that adage that Python has "batteries included", but not nuclear reactors. PyPI is a better place for it, for those who need more than what the standard re module provides.
I think it's a puzzle with no good solution. For example, while atomic groups may have been "nuclear reactor" level in 2008, they're ubiquitous now. regexps in industry practice keep evolving, and so does your regex module, but Python's re module is frozen in an ever-receding past. Nobody wants to work on it because, well, "regex already does that! In fact, it's been doing it for 15 years already". Your module is _too_ successful for Python's good ;-)
[J.B. Langston <jblangston@datastax.com> ]
Thanks for the conclusive answer.
Not conclusive - just my opinion. Which is informed, but not infallible ;-)
I will checkout the regex library soon.
You may not realize how easy this is? Just in case: go to a shell and type pip install regex (or, on Windows, "python -m pip install regex" in a DOS box). That's it. You're done. Now you can use regex. In some cases, you can put "import regex as re" at the top of a module At worst, replace instances of "re" with "regex". Stay away from the new features, and it's highly compatible with Python;s re.
On 17 Feb 2022, at 01:04, Tim Peters <tim.peters@gmail.com> wrote:
[J.B. Langston <jblangston@datastax.com> ]
Thanks for the conclusive answer.
Not conclusive - just my opinion. Which is informed, but not infallible ;-)
I will checkout the regex library soon.
You may not realize how easy this is? Just in case: go to a shell and type
pip install regex
(or, on Windows, "python -m pip install regex" in a DOS box).
That's it. You're done. Now you can use regex. In some cases, you can put "import regex as re" at the top of a module At worst, replace instances of "re" with "regex". Stay away from the new features, and it's highly compatible with Python;s re.
I suspect that like me what was meant is that checkout means read the docs to understand regex features. The install is trivial. Barry
_______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/C35D2Z... Code of Conduct: http://python.org/psf/codeofconduct/
On Wed, Feb 16, 2022 at 07:01:42PM -0600, Tim Peters wrote:
You may not realize how easy this is? Just in case: go to a shell and type
pip install regex
(or, on Windows, "python -m pip install regex" in a DOS box).
That's it. You're done.
Easier said than actually done. https://discuss.python.org/t/pip-missing-from-my-python-download/13750/1 That's not the only failure mode. https://duckduckgo.com/?q=pip+install+error https://www.google.com/search?q=pip+install+error I'm especially fond of questions from people who have successfully installed some library using pip, and then can't import that library from Python. That's always fun to diagnose. And just in case using pip was too hard for the newbie, don't worry, we can always make it harder by telling them to set up an venv first o_O In many workplaces (and schools), with strict rules about not installing unapproved software on company machines (instant firing offence), you are right: just run pip install and you're done. I've worked at some of them. Pip is a good hammer, and many people see every problem as a nail to be solved with "just use pip" -- but very few of them hang around on user forums to help beginners when "just use pip" fails, as if often does :-(
Now you can use regex. In some cases, you can put "import regex as re" at the top of a module At worst, replace instances of "re" with "regex". Stay away from the new features, and it's highly compatible with Python;s re.
If you're only using the features that re supports, wouldn't it be easier to just use re? -- Steve
[Steven D'Aprano <steve@pearwood.info>]
[skipping FUD about pip]
If the OP: has problems with pip they can't easily resolve, I expect they'll say so. In my experience, effective help consists of sticking to the simplest things that can possibly work at first, not bury a questioner with an exhaustive account of everything that could possibly go wrong. That's your job ;-) Note that there's not the slightest reason to expect that the OP here is a newbie; quite the contrary.
Now you can use regex. In some cases, you can put "import regex as re" at the top of a module At worst, replace instances of "re" with "regex". Stay away from the new features, and it's highly compatible with Python;s re.
If you're only using the features that re supports, wouldn't it be easier to just use re?
Th\e OP explicitly doesn't want to stick to th\e features re supports. Specifically, they want to use the optional timeouts that regex supports but re doesn't. What I wrote here is more elaboration on that _trying_ this is easier than they might be thinking: They don't have to, e.g., rewrite their regexps, or invoke different function or method names, or worry that they'll get different results. The packages are highly compatible in syntax and semantics and APIs so long as you stick to the things re does. That in no way suggests they _should_ stick to what re does. It's assuring them \that _getting started_ is close to trivial. Their current code should continue to work unchanged, apart from just changing "re" to "regex".
I have no problem installing software via pip. This is a large project that has many other dependencies already managed via pip and virtualenv.
And, as it turns out, I already had it installed via some transitive dependency. The bigger task will be to testing. Indeed it may be as easy as "import regex as re" but I need to test and make sure my regexes still work as expected and that the performance doesn't take a big hit. Trust, but verify :)
And unfortunately it does appear that my app took an almost a 20% performance hit from using regex instead of re, unfortunately. Processing time for a test dataset with 700MB of logs went from 77 seconds with the standard library re to 92 seconds with regex. Profiling confirms that the time spent in the groupdict method went from 3.27% to 9.41% and the time spent in match went from 5.19 to 10.33% of the execution time. So this means switching to regex is probably a no go. If hanging regexes become a common occurrence for my app I might decide it's worth the performance hit in the name of safety, but at this point I would rather not.
[J.B. Langston <jblangston@datastax.com>]
And unfortunately it does appear that my app took an almost a 20% performance hit from using regex instead of re, unfortunately. Processing time for a test dataset with 700MB of logs went from 77 seconds with the standard library re to 92 seconds with regex. Profiling confirms that the time spent in the groupdict method went from 3.27% to 9.41% and the time spent in match went from 5.19 to 10.33% of the execution time.
I'm mildly surprised! Most times people report that regex is at least modestly faster than re. For peak performance at the expense of flexibility (e.g., no backreferences supported), perhaps you'd get a major speed gain from an entirely different regexp engine approach. Google built such an engine, which has worst-case linear matching runtime (but, depending on details, _may_ require space exponential in the regexp to compile a regexp object). A Python binding for that is here ("pip install pyre2"): https://pypi.org/project/pyre2/ I haven't used it - this isn't a particular area of interest for me. Note that in its "Performance" section, it shows examples where it blows re out of the water. But, in all those cases, regex was modestly to very significantly faster than re too. YMMV.
So this means switching to regex is probably a no go.
The difference between 77 and 92 seconds doesn't, on the face of it, scream "disaster" to me - but suit yourself.
If hanging regexes become a common occurrence for my app I might decide it's worth the performance hit in the name of safety, but at this point I would rather not.
If they were destined to become a common occurrence, they already would have done so. You blamed your bad case on unexpected data, but the _actual_ cause was an unintended typo in one of your regexps. So it goes. You're using a tool with a hyper-concise notation, where a correct expression is pretty much indistinguishable from line noise, and a typo is rarely detectable as a syntax error. So you're learning the hard way that you have to be on crisis-level alert when writing regexps: they're extremely touchy and unforgiving, pyre2 would spare you from all match-time timing disasters, but "touchy and unforgiving" applies all the same. Instead of a typo causing exponential runtime, it may instead cause the regexp to match (or fail to match) in unintended ways. About which no clue of any kind will be left behind, unless you stare at the inputs and outputs and check them yourself. But, in that case, 92 seconds wouldn't even get you through 1000 bytes ;-)
participants (14)
-
Barry
-
Ben Rudiak-Gould
-
Bruce Leban
-
Chris Angelico
-
David Mertz, Ph.D.
-
J.B. Langston
-
Jonathan Slenders
-
MRAB
-
Nick Timkovich
-
Paul Moore
-
Ricky Teachey
-
Stephen J. Turnbull
-
Steven D'Aprano
-
Tim Peters