Cancel or timeout a long running regular expression

Nobody nobody at nowhere.com
Fri Sep 16 15:57:06 CEST 2011


On Thu, 15 Sep 2011 14:54:57 -0400, Terry Reedy wrote:

>> I was thinking there might be a technique I could use to evaluate
>> regular expressions in a thread or another process launched via
>> multiprocessing module and then kill the thread/process after a
>> specified timeout period.
> 
> Only solution I remember ever seen posted. I wonder if there are any 
> heuristics for detecting exponential time re's.

Exponential growth results from non-determinism, i.e. if there are
multiple transitions for a given character from a given state.

Common patterns include:

	...(a...)?a...
	...(a...)*a...
	...(a...)+a...

with a choice between matching the "a" at the start of the bracketed
pattern or the "a" following it.

	(xxxa...|xxxb...|xxxc...)

with a choice between branches which cannot be resolved until more data
has been read.

For re.search (as opposed to re.match):

	axxxa...

When axxx has been read, a following "a" gives a choice between
continuing the existing match with the second "a", or aborting and
matching against the first "a". For patterns which contain many copies of
the initial character, each copy creates another branch.

Also, using back-references in a regexp [sic] can be a significant
performance killer, as it rules out the use of a DFA (which, IIRC, Python
doesn't do anyhow) and can require brute-forcing many combinations. A
particularly bad case is:

	(a*)\1

matching against "aaaaaaaa...".

If the performance issue is with re.match/re.search rather than with
re.compile, one option is to use ctypes to access libc's regexp functions.
Those are likely to provide better searching throughput at the expense of
potentially increased compilation time.




More information about the Python-list mailing list