Some questions about maintenance of the regular expression code.
I've decided to answer Guido's call for someone to take over maintenance of the SRE code since it has started to fall into disrepair. First a short introduction and then on with a question that begs for some discussion on this list. My name is Gary Herron. I've been using Python whenever possible for about 8 years (and for most of the last year and a half I've been able to choose Python almost exclusively -- lucky me). I've mostly lurked around the python and python-dev lists, only occasionally offering help or comments. Volunteering to maintain the SRE code seems like a good opportunity to jump in and do something useful. Now on with the questions at hand: The first glance at the regular expression bug list and the _sre.c code results in the observation that several of the bugs are related to running over the recursion limit. The problem comes from using a pattern containing ".*?" in a situation where it is expected to match many thousands of characters. Each character matched by ".*?" causes one level or recursion, quickly overflowing the recursion limit. Question 1: Should we even consider these as bugs? After all the recursion limit is in place to prevent badly used re's from crashing Python with a stack overflow. We could claim the kinds of patterns which cause heavy recursion are miss-uses of regular expressions which are bound to fail when used on long strings. If we take this route, something should be added to the documentation which explains when excessive recursion is likely to bite. Question 2: If we want to solve the problem (instead of just dodging it) how should we proceed? * Increasing the limit beyond the current 10000 is not really an option for two reasons: 1. This doesn't solve the problem. One can always match on a string purposely chosen to be long enough to overflow any recursion limit. 2. A recent patch (browse "cvs log _sre.c" to find a reference) actually lowered the limit from 10000 to 7500 for certain 64-bit machines which apparently suffered a stack overflow before hitting 10000 recursion levels. * An attempt to replace the hard-coded upper limit with a programmed check of the stack space (see Misc/HISTORY for a reference to PyOS_CheckStack) was added and then withdrawn for version 2.0. Does anybody know the history of this? This would not really solve the problem (especially on the 64 bit machines which could not even hit 10000 levels of recursion), but it would push the recursion limit to its highest possible value rather than some arbitrary hard-coded value. * Removing the recursion by the standard method of storing state in a program managed stack and looping rather than recursing would push the storage problem from the stack into the (probably much larger) heap. I haven't looked at the code enough to judge if this is feasible, but if it is, some limit would still remain. It would, however, depend on available memory rather than stack space. And still, the documentation should warn that certain naive pattens on LONG strings could fail after wasting much time chewing through all available memory. * I notice that, unlike pattern ".*?", matching to pattern ".*" does not recurse for each character matched. With only a few minutes of looking at the code, I can't begin to guess if it is feasible to make the former work like the later without recursing. Any comments? Remember that all the points under question 2 are worth considering only if we decide we really ought to support things like patterns using ".*?" to match many thousands of characters. Thanks, Gary Herron
Gary Herron wrote:
The first glance at the regular expression bug list and the _sre.c code results in the observation that several of the bugs are related to running over the recursion limit. The problem comes from using a pattern containing ".*?" in a situation where it is expected to match many thousands of characters. Each character matched by ".*?" causes one level or recursion, quickly overflowing the recursion limit.
Wouldn't it be possible for the RE compiler to issue a warning in case these kind of patterns are used ? This would be much more helpful than trying to work-around the user problem. -- Marc-Andre Lemburg eGenix.com Professional Python Software directly from the Source (#1, Feb 26 2003)
Python/Zope Products & Consulting ... http://www.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
Python UK 2003, Oxford: 34 days left EuroPython 2003, Charleroi, Belgium: 118 days left
On Wednesday 26 February 2003 01:08 am, M.-A. Lemburg wrote:
Gary Herron wrote:
The first glance at the regular expression bug list and the _sre.c code results in the observation that several of the bugs are related to running over the recursion limit. The problem comes from using a pattern containing ".*?" in a situation where it is expected to match many thousands of characters. Each character matched by ".*?" causes one level or recursion, quickly overflowing the recursion limit.
Wouldn't it be possible for the RE compiler to issue a warning in case these kind of patterns are used ? This would be much more helpful than trying to work-around the user problem.
I think not. It's not the pattern that's the problem. A pattern containing ".*?" is perfectly legitimate and useful. The problem arises when the pattern is used on a string which has thousands of characters which match. By that point the RE compiler is right out of the picture. Gary Herron
Gary Herron wrote:
On Wednesday 26 February 2003 01:08 am, M.-A. Lemburg wrote:
Gary Herron wrote:
The first glance at the regular expression bug list and the _sre.c code results in the observation that several of the bugs are related to running over the recursion limit. The problem comes from using a pattern containing ".*?" in a situation where it is expected to match many thousands of characters. Each character matched by ".*?" causes one level or recursion, quickly overflowing the recursion limit.
Wouldn't it be possible for the RE compiler to issue a warning in case these kind of patterns are used ? This would be much more helpful than trying to work-around the user problem.
I think not. It's not the pattern that's the problem. A pattern containing ".*?" is perfectly legitimate and useful.
Hmm, could you explain where ".*?" is useful ?
The problem arises when the pattern is used on a string which has thousands of characters which match. By that point the RE compiler is right out of the picture.
-- Marc-Andre Lemburg eGenix.com Professional Python Software directly from the Source (#1, Feb 26 2003)
Python/Zope Products & Consulting ... http://www.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
Python UK 2003, Oxford: 34 days left EuroPython 2003, Charleroi, Belgium: 118 days left
On Wednesday 26 February 2003 10:23 am, M.-A. Lemburg wrote:
Gary Herron wrote:
On Wednesday 26 February 2003 01:08 am, M.-A. Lemburg wrote:
Gary Herron wrote:
The first glance at the regular expression bug list and the _sre.c code results in the observation that several of the bugs are related to running over the recursion limit. The problem comes from using a pattern containing ".*?" in a situation where it is expected to match many thousands of characters. Each character matched by ".*?" causes one level or recursion, quickly overflowing the recursion limit.
Wouldn't it be possible for the RE compiler to issue a warning in case these kind of patterns are used ? This would be much more helpful than trying to work-around the user problem.
I think not. It's not the pattern that's the problem. A pattern containing ".*?" is perfectly legitimate and useful.
Hmm, could you explain where ".*?" is useful ?
Yes, easily. It's the non-greedy version of "match all". The manual page for the re module has this nice example: *?, +?, ?? The "*", "+", and "?" qualifiers are all greedy; they match as much text as possible. Sometimes this behaviour isn't desired; if the RE <.*> is matched against '<H1>title</H1>', it will match the entire string, and not just '<H1>'. Adding "?" after the qualifier makes it perform the match in non-greedy or minimal fashion; as few characters as possible will be matched. Using .*? in the previous expression will match only '<H1>'.
The problem arises when the pattern is used on a string which has thousands of characters which match. By that point the RE compiler is right out of the picture.
M.-A. Lemburg asked:
Hmm, could you explain where ".*?" is useful ?
Gary Herron responded:
Yes, easily. It's the non-greedy version of "match all". The manual page for the re module has this nice example:
I think the problem is that .*? is too similar to (.*)? -- especially for people who aren't familiar with the non-greedy versions of the operators. This certainly stumped me as well. I'm not sure what can be done about that, though. -Fred -- Fred L. Drake, Jr. <fdrake at acm.org> PythonLabs at Zope Corporation
On Wednesday 26 February 2003 10:39 am, Fred L. Drake, Jr. wrote:
M.-A. Lemburg asked:
Hmm, could you explain where ".*?" is useful ?
Gary Herron responded:
Yes, easily. It's the non-greedy version of "match all". The manual page for the re module has this nice example:
I think the problem is that .*? is too similar to (.*)? -- especially for people who aren't familiar with the non-greedy versions of the operators. This certainly stumped me as well.
I'm not sure what can be done about that, though.
It stumped my for a while also, even though I've used them on occasion. In general I find regular expressions mostly unreadable. Solutions for that problem would occupy a (probably long and hopefully different) thread.
[Gary Herron]
It stumped my for a while also, even though I've used them on occasion.
The '|' has always thrown me; always want it to bind more tightly.
In general I find regular expressions mostly unreadable.
Ditto.
Solutions for that problem would occupy a (probably long and hopefully different) thread.
Is there enough interest to actually put the time and effort into attempting to develop new regex syntax? There is obviously always the option to follow Perl 6's suggested new syntax. We could also come up with a more Pythonic solution. I suspect this would be big enough to require its own SIG or something. Ping has a module at http://lfw.org/python/rxb15.py that creates regexes patterns using a much more straight-forward syntax by using actual words to represent things; there are functions like 'maybe', 'exactly', etc. (don't know how powerful it is, though; he just showed it to me briefly two weeks ago). -Brett
On Wed, 26 Feb 2003, Brett Cannon wrote:
Is there enough interest to actually put the time and effort into attempting to develop new regex syntax?
I would be excited about this. I've long dreamed of a more human-friendly way to do regular expressions. Let's see if there's interest here, and if there is enough, perhaps we can discuss the details on a separate list (no point in having more syntax arguments on python-dev until there's a concrete proposal, i think).
Ping has a module at http://lfw.org/python/rxb15.py that creates regexes patterns using a much more straight-forward syntax by using actual words to represent things; there are functions like 'maybe', 'exactly', etc. (don't know how powerful it is, though; he just showed it to me briefly two weeks ago).
It can do most things that i use regular expressions for. I used it regularly for a while, but it's somewhat slow. -- ?!ng
On Wednesday 26 February 2003 12:50 pm, Ka-Ping Yee wrote:
On Wed, 26 Feb 2003, Brett Cannon wrote:
Is there enough interest to actually put the time and effort into attempting to develop new regex syntax?
Interesting idea, but lets get back to topic at hand, that being an attempt to *reduce* the number of regular expression bugs not *increase* their count. :-) What to do about RE's that hit the recursion limit? Gary Herron
Gary Herron wrote:
What to do about RE's that hit the recursion limit?
I haven't had a chance to look at the SRE code in detail yet, but expect that the problem is that the straightforward way to implement .*? is with a C function call. PCRE also implemented it with a function call; Python avoided the limit by forking PCRE and adding a heap-allocated stack and pushing states onto the stack instead of recursing, which is why pre avoids the recursion limit (well, until you fill up your entire heap). The problem is that the patches to do this were extensive and ugly. Maybe just fixing the problem for .* will be sufficient. Incidentally, before we go making lots of changes, has anyone actually pinged Fredrik about the SRE code? His usual approach with software is to process patches and TODO items every N months; maybe he just hasn't gotten around to looking at the SRE backlog? --amk (www.amk.ca) KENT: Thou whoreson zed! thou unnecessary letter! -- _King Lear_, II, ii
On Wednesday 26 February 2003 01:57 pm, A.M. Kuchling wrote:
Gary Herron wrote:
What to do about RE's that hit the recursion limit?
I haven't had a chance to look at the SRE code in detail yet, but expect that the problem is that the straightforward way to implement .*? is with a C function call. PCRE also implemented it with a function call; Python avoided the limit by forking PCRE and adding a heap-allocated stack and pushing states onto the stack instead of recursing, which is why pre avoids the recursion limit (well, until you fill up your entire heap).
The problem is that the patches to do this were extensive and ugly. Maybe just fixing the problem for .* will be sufficient.
Actually .* does not seem to have this recursion problem, only .*? -- I don't yet know why.
Incidentally, before we go making lots of changes, has anyone actually pinged Fredrik about the SRE code? His usual approach with software is to process patches and TODO items every N months; maybe he just hasn't gotten around to looking at the SRE backlog?
I sent a copy of my original post to him. I've had no response yet, but it's been less than a day. Gary Herron
Gary Herron wrote:
I wrote:
Maybe just fixing the problem for .* will be sufficient.
Sorry, brain error on my part: I meant fixing it for .*?. .* doesn't have a problem because the engine can chew up as many matches for '.' as possible, try the rest of the pattern, and then back up for '.'. In other words: .* requires stack space proportional to the size of the regex pattern, which is OK; .*?, as currently implemented, requires stack space proportional to the size of *the string being matched*, which is what causes the problem. --amk (www.amk.ca) MIRANDA: Your tale, sir, would cure deafness. -- _The Tempest_, I, ii
Is there enough interest to actually put the time and effort into attempting to develop new regex syntax?
I suggest dropping this subject before it takes over all available bandwidth in python-dev. :-) --Guido van Rossum (home page: http://www.python.org/~guido/)
[Guido van Rossum]
Is there enough interest to actually put the time and effort into attempting to develop new regex syntax?
I suggest dropping this subject before it takes over all available bandwidth in python-dev. :-)
=) Agreed. If anyone wants to continue discussing this, just email me personally and we can take this off-list and gauge the interest level. -Brett
Brett Cannon wrote:
In general I find regular expressions mostly unreadable.
Ditto.
I'd say both regular expressions and the corresponding python code can be pretty incomprehensible. Regexes because they're too dense and complex, python code becuase it's too lengthy to describe what parts of a string you're after. I've stumbled upon both 100 char regexes and 25 line python string manipulation snippets - both variants taking a while to figure out what they do. ...but I guess the design of a new, better, string matching sublanguage wasn't the topic here? or am I mistaken? Erik
...but I guess the design of a new, better, string matching sublanguage wasn't the topic here?
Not unless you want to change Python into Perl. --Guido van Rossum (home page: http://www.python.org/~guido/)
Gary Herron wrote:
On Wednesday 26 February 2003 10:23 am, M.-A. Lemburg wrote:
The first glance at the regular expression bug list and the _sre.c code results in the observation that several of the bugs are related to running over the recursion limit. The problem comes from using a pattern containing ".*?" in a situation where it is expected to match many thousands of characters. Each character matched by ".*?" causes one level or recursion, quickly overflowing the recursion limit.
Wouldn't it be possible for the RE compiler to issue a warning in case these kind of patterns are used ? This would be much more helpful than trying to work-around the user problem.
I think not. It's not the pattern that's the problem. A pattern containing ".*?" is perfectly legitimate and useful.
Hmm, could you explain where ".*?" is useful ?
Yes, easily. It's the non-greedy version of "match all". The manual page for the re module has this nice example:
*?, +?, ?? The "*", "+", and "?" qualifiers are all greedy; they match as much text as possible. Sometimes this behaviour isn't desired; if the RE <.*> is matched against '<H1>title</H1>', it will match the entire string, and not just '<H1>'. Adding "?" after the qualifier makes it perform the match in non-greedy or minimal fashion; as few characters as possible will be matched. Using .*? in the previous expression will match only '<H1>'.
Ah, ok. I usually write "<[^>]+>" for these things, if at all... I tend to use mxTextTools for parsing :-)
The problem arises when the pattern is used on a string which has thousands of characters which match. By that point the RE compiler is right out of the picture.
-- Marc-Andre Lemburg eGenix.com Professional Python Software directly from the Source (#1, Feb 26 2003)
Python/Zope Products & Consulting ... http://www.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
Python UK 2003, Oxford: 34 days left EuroPython 2003, Charleroi, Belgium: 118 days left
M.-A. Lemburg wrote:
Hmm, could you explain where ".*?" is useful ?
The most common example comes from writing a regex to match bits of HTML. If you have a pattern like this: <a href="([^"]*)"> (.*) </a> and your input text is <a href="a">blah</a> <a href="b">blah</a>, group 2 in the pattern will match 'blah</a> <a href="b">blah', which is not the link text that you want. If you write .*?, then group 2 matches just "blah", which is more useful. --amk (www.amk.ca) Ominous, isn't it? -- The Doctor, in "The Awakening"
An irony is that *? is almost always used like so: .*? some_literal_string where the user *intends* "search right for the first appearance of some_literal_string, and chew up the characters until then". If that's what .*? actually meant (it doesn't), it could be implemented very efficiently without recursion. Instead, regexp engines work supernaturally hard to implement something few users understand, and very few actually want. I love regexps <wink>.
Greg Chapman may have provided an effective end to this thread. He sent me (apparently via private email since I haven't seen it on this list yet) a patch which removes the recursion in simple uses of .*?. The code is patterened after the way simple uses of .* are currently handled without recursion. Still complex uses of either minimizing or maximizing repeats will recurse. For example (a|b)*c and (a|b)*?c will both recurse as they chew through a string of a's and b's. I don't consider this a bug. I'll test his patch tonight. Thank you Greg, Gary Herron
participants (9)
-
A.M. Kuchling
-
Brett Cannon
-
erik heneryd
-
Fred L. Drake, Jr.
-
Gary Herron
-
Guido van Rossum
-
Ka-Ping Yee
-
M.-A. Lemburg
-
Tim Peters