[ python-Bugs-1662581 ] the re module can perform poorly: O(2**n) versus O(n**2)

SourceForge.net noreply at sourceforge.net
Thu Feb 22 23:30:49 CET 2007

Bugs item #1662581, was opened at 2007-02-17 15:39
Message generated for change (Comment added) made by greg
You can respond by visiting: 

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: Performance
>Group: Feature Request
Status: Open
Resolution: None
>Priority: 3
Private: No
Submitted By: Gregory P. Smith (greg)
Assigned to: Nobody/Anonymous (nobody)
Summary: the re module can perform poorly: O(2**n) versus O(n**2)

Initial Comment:
in short, the re module can degenerate to really really horrid performance.  See this for how and why:


exponential decline instead of squared.

I don't have a patch so i'm filing this bug as a starting point for future work.  The Modules/_sre.c files implementation could be updated to use the parallel stepping Thompson approach instead of recursive backtracking.

filing this as a bug until me or someone else comes up with a patch.


>Comment By: Gregory P. Smith (greg)
Date: 2007-02-22 14:30

Logged In: YES 
Originator: YES

yeah this is better as a feature request.  certianly low priority either

-nothing- I propose doing would change the syntax or behaviour of existing
regular expressions at all.  Doing so would be a disaster.  thompson nfa
does not imply changing the behaviour.

anyways its a lot more than a simple "patch" to change the re module to
not use backtracking so i expect this to languish unless someone has a of
free time and motivation all at once. :)


Comment By: Josiah Carlson (josiahcarlson)
Date: 2007-02-22 00:51

Logged In: YES 
Originator: NO

I would file this under "feature request"; the current situation isn't so
much buggy, as slow.  While you can produce a segfault with the current
regular expression engine (due to stack overflow), you can do the same
thing with regular Python on Linux (with sys.setrecursionlimit), ctypes,
etc., and none of those are considered as buggy.

My only concern with such a change is that it may or may not change the
semantics of the repeat operators '*' and '+', which are currently defined
as "greedy".  If I skimmed the article correctly late at night, switching
to a Thompson family regular expression engine may result in those
operators no longer being greedy.  Please correct me if I am wrong.


You can respond by visiting: 

More information about the Python-bugs-list mailing list