[Spambayes] Mail with problem

Thu Nov 14 20:01:50 2002

[Tim Stone]
> Depending on what kind of regex engine python has (NFA or DFA)
> and on how the html parsing regex is implemented relative to its
> engine, it can take an enormous amount of memory.

That isn't the problem here.  There are no "runaway" regexps.  The problem
is that minimal matching is implemented via recursive call, one level per
character matched.  That's a long-standing problem.  All minimal matches in
the tokenizer regexps are bounded, but it's impossible to guess an upper
limit that's "safe" across every half-a-brain C implementation in existence.

> For example, with an NFA and a regex that uses alternation in certain
ways,
> the stack can grow exponentially.

Yes, but that's not the case here.

> We may want to take a hard look at tokenizer's html parsing
> regex.  I looked  at it briefly yesterday, but didn't pay much attention.
>
> Tim, do you know if the python regex is NFA or DFA?

NFA.

> If it's NFA, is there a DFA engine we can plug in?

No.  With great pain, the regexp in question could be rewritten to avoid
minimal matches.  I'd rather the OP convince his OS to let him use some of
the dozens of megabytes sitting idle on his machine <wink>.