How to get the "longest possible" match with Python's RE module?
Bryan Olson
fakeaddress at nowhere.org
Wed Sep 13 00:12:16 EDT 2006
Tim Peters wrote:
> [...] The most valuable general technique he (eventually ;-)
> explained he called "unrolling", and consists of writing a regexp in
> the form:
>
> normal* (?: special normal* )*
>
> where the sets of characters with which `normal` and `special` can
> start are disjoint. Under most "NFA" implementation, such a regexp is
> immune to most bad behaviors, whether finding very long matches, or in
> searching a very long string that happens not to contain any matches.
[...]
> As a result, you can use it with reasonably high confidence. For
> /why/ that's so, read the rest of his book ;-) In effect, it gives
> the search engine only one possible course of action at each character
> in the target string. So long as that's "normal", it has to stay in
> the "normal" part of the regexp, and as soon as it's "special" it has
> to leave the "normal" part. The intent is to eliminate any
> possibility for backtracking, thus ensuring predictable runtime and
> ensuring that no form of internal backtracking stack needs to be used
> (let alone hit its limit).
So how general is that? The books just says "certain common
expressions". I think the real answer is that one can unroll
exactly the true regular expressions. The unrolling is
essentially resolving the states of a DFA by hand.
--
--Bryan
More information about the Python-list
mailing list