[Python-Dev] Some questions about maintenance of the regular expression code.

Gary Herron gherron@islandtraining.com
Wed, 26 Feb 2003 00:22:15 -0800

I've decided to answer Guido's call for someone to take over
maintenance of the SRE code since it has started to fall into
disrepair.  First a short introduction and then on with a question
that begs for some discussion on this list.

My name is Gary Herron.  I've been using Python whenever possible for
about 8 years (and for most of the last year and a half I've been able
to choose Python almost exclusively -- lucky me).  I've mostly lurked
around the python and python-dev lists, only occasionally offering
help or comments.  Volunteering to maintain the SRE code seems like a
good opportunity to jump in and do something useful.

Now on with the questions at hand:

The first glance at the regular expression bug list and the _sre.c
code results in the observation that several of the bugs are related
to running over the recursion limit.  The problem comes from using a
pattern containing ".*?" in a situation where it is expected to match
many thousands of characters.  Each character matched by ".*?" causes
one level or recursion, quickly overflowing the recursion limit.

Question 1:  Should we even consider these as bugs?

  After all the recursion limit is in place to prevent badly used re's
  from crashing Python with a stack overflow. We could claim the kinds
  of patterns which cause heavy recursion are miss-uses of regular
  expressions which are bound to fail when used on long strings.  If
  we take this route, something should be added to the documentation
  which explains when excessive recursion is likely to bite.

Question 2: If we want to solve the problem (instead of just dodging
it) how should we proceed?

 * Increasing the limit beyond the current 10000 is not really an
   option for two reasons:

    1. This doesn't solve the problem.  One can always match on a
       string purposely chosen to be long enough to overflow any
       recursion limit.

    2. A recent patch (browse "cvs log _sre.c" to find a reference)
       actually lowered the limit from 10000 to 7500 for certain
       64-bit machines which apparently suffered a stack overflow
       before hitting 10000 recursion levels.

 * An attempt to replace the hard-coded upper limit with a programmed
   check of the stack space (see Misc/HISTORY for a reference to
   PyOS_CheckStack) was added and then withdrawn for version 2.0.
   Does anybody know the history of this?  This would not really solve
   the problem (especially on the 64 bit machines which could not even
   hit 10000 levels of recursion), but it would push the recursion
   limit to its highest possible value rather than some arbitrary
   hard-coded value.

 * Removing the recursion by the standard method of storing state in a
   program managed stack and looping rather than recursing would push
   the storage problem from the stack into the (probably much larger)
   heap. I haven't looked at the code enough to judge if this is
   feasible, but if it is, some limit would still remain. It would,
   however, depend on available memory rather than stack space.  And
   still, the documentation should warn that certain naive pattens on
   LONG strings could fail after wasting much time chewing through all
   available memory.

 * I notice that, unlike pattern ".*?", matching to pattern ".*" does
   not recurse for each character matched.  With only a few minutes of
   looking at the code, I can't begin to guess if it is feasible to
   make the former work like the later without recursing.

Any comments?  Remember that all the points under question 2 are worth
considering only if we decide we really ought to support things like
patterns using ".*?"  to match many thousands of characters.

Gary Herron