Cancel or timeout a long running regular expression
tjreedy at udel.edu
Sat Sep 17 00:01:27 CEST 2011
On 9/16/2011 9:57 AM, Nobody wrote:
>>I wonder if there are any
>> heuristics for detecting exponential time re's.
> Exponential growth results from non-determinism, i.e. if there are
> multiple transitions for a given character from a given state.
> Common patterns include:
> with a choice between matching the "a" at the start of the bracketed
> pattern or the "a" following it.
> with a choice between branches which cannot be resolved until more data
> has been read.
> For re.search (as opposed to re.match):
> When axxx has been read, a following "a" gives a choice between
> continuing the existing match with the second "a", or aborting and
> matching against the first "a". For patterns which contain many copies of
> the initial character, each copy creates another branch.
> Also, using back-references in a regexp [sic] can be a significant
> performance killer, as it rules out the use of a DFA (which, IIRC, Python
> doesn't do anyhow) and can require brute-forcing many combinations. A
> particularly bad case is:
> matching against "aaaaaaaa...".
> If the performance issue is with re.match/re.search rather than with
> re.compile, one option is to use ctypes to access libc's regexp functions.
> Those are likely to provide better searching throughput at the expense of
> potentially increased compilation time.
Now, can you write that as a heuristic *algorithm* def
Terry Jan Reedy
More information about the Python-list