[Python-Dev] Some questions about maintenance of the regular expression code.
Gary Herron
gherron@islandtraining.com
Wed, 26 Feb 2003 00:22:15 -0800
I've decided to answer Guido's call for someone to take over
maintenance of the SRE code since it has started to fall into
disrepair. First a short introduction and then on with a question
that begs for some discussion on this list.
My name is Gary Herron. I've been using Python whenever possible for
about 8 years (and for most of the last year and a half I've been able
to choose Python almost exclusively -- lucky me). I've mostly lurked
around the python and python-dev lists, only occasionally offering
help or comments. Volunteering to maintain the SRE code seems like a
good opportunity to jump in and do something useful.
Now on with the questions at hand:
The first glance at the regular expression bug list and the _sre.c
code results in the observation that several of the bugs are related
to running over the recursion limit. The problem comes from using a
pattern containing ".*?" in a situation where it is expected to match
many thousands of characters. Each character matched by ".*?" causes
one level or recursion, quickly overflowing the recursion limit.
Question 1: Should we even consider these as bugs?
After all the recursion limit is in place to prevent badly used re's
from crashing Python with a stack overflow. We could claim the kinds
of patterns which cause heavy recursion are miss-uses of regular
expressions which are bound to fail when used on long strings. If
we take this route, something should be added to the documentation
which explains when excessive recursion is likely to bite.
Question 2: If we want to solve the problem (instead of just dodging
it) how should we proceed?
* Increasing the limit beyond the current 10000 is not really an
option for two reasons:
1. This doesn't solve the problem. One can always match on a
string purposely chosen to be long enough to overflow any
recursion limit.
2. A recent patch (browse "cvs log _sre.c" to find a reference)
actually lowered the limit from 10000 to 7500 for certain
64-bit machines which apparently suffered a stack overflow
before hitting 10000 recursion levels.
* An attempt to replace the hard-coded upper limit with a programmed
check of the stack space (see Misc/HISTORY for a reference to
PyOS_CheckStack) was added and then withdrawn for version 2.0.
Does anybody know the history of this? This would not really solve
the problem (especially on the 64 bit machines which could not even
hit 10000 levels of recursion), but it would push the recursion
limit to its highest possible value rather than some arbitrary
hard-coded value.
* Removing the recursion by the standard method of storing state in a
program managed stack and looping rather than recursing would push
the storage problem from the stack into the (probably much larger)
heap. I haven't looked at the code enough to judge if this is
feasible, but if it is, some limit would still remain. It would,
however, depend on available memory rather than stack space. And
still, the documentation should warn that certain naive pattens on
LONG strings could fail after wasting much time chewing through all
available memory.
* I notice that, unlike pattern ".*?", matching to pattern ".*" does
not recurse for each character matched. With only a few minutes of
looking at the code, I can't begin to guess if it is feasible to
make the former work like the later without recursing.
Any comments? Remember that all the points under question 2 are worth
considering only if we decide we really ought to support things like
patterns using ".*?" to match many thousands of characters.
Thanks,
Gary Herron