[Python-Dev] Some questions about maintenance of the regular expression code.

Gary Herron gherron@islandtraining.com
Wed, 26 Feb 2003 10:34:11 -0800


On Wednesday 26 February 2003 10:23 am, M.-A. Lemburg wrote:
> Gary Herron wrote:
> > On Wednesday 26 February 2003 01:08 am, M.-A. Lemburg wrote:
> >>Gary Herron wrote:
> >>>The first glance at the regular expression bug list and the _sre.c
> >>>code results in the observation that several of the bugs are related
> >>>to running over the recursion limit.  The problem comes from using a
> >>>pattern containing ".*?" in a situation where it is expected to match
> >>>many thousands of characters.  Each character matched by ".*?" causes
> >>>one level or recursion, quickly overflowing the recursion limit.
> >>
> >>Wouldn't it be possible for the RE compiler to issue a warning in
> >>case these kind of patterns are used ? This would be much more helpful
> >>than trying to work-around the user problem.
> >
> > I think not.  It's not the pattern that's the problem.  A pattern
> > containing ".*?" is perfectly legitimate and useful.
>
> Hmm, could you explain where ".*?" is useful ?

Yes, easily.  It's the non-greedy version of "match all".  The manual
page for the re module has this nice example:

*?, +?, ?? 
  The "*", "+", and "?" qualifiers are all greedy; they match as much
  text as possible. Sometimes this behaviour isn't desired; if the RE
  <.*> is matched against '<H1>title</H1>', it will match the entire
  string, and not just '<H1>'. Adding "?" after the qualifier makes it
  perform the match in non-greedy or minimal fashion; as few
  characters as possible will be matched. Using .*? in the previous
  expression will match only '<H1>'.

> > The problem
> > arises when the pattern is used on a string which has thousands of
> > characters which match.  By that point the RE compiler is right out of
> > the picture.