[Spambayes] Mail with problem

Tim Stone - Four Stones Expressions tim@fourstonesExpressions.com
Thu Nov 14 19:48:36 2002

Depending on what kind of regex engine python has (NFA or DFA) and on how the 
html parsing regex is implemented relative to its engine, it can take an 
enormous amount of memory.  For example, with an NFA and a regex that uses 
alternation in certain ways, the stack can grow exponentially.

We may want to take a hard look at tokenizer's html parsing regex.  I looked 
at it briefly yesterday, but didn't pay much attention.

Tim, do you know if the python regex is NFA or DFA?  If it's NFA, is there a 
DFA engine we can plug in?

- TimS
11/14/2002 1:24:58 PM, Tim Peters <tim.one@comcast.net> wrote:

>[Francois Granger]
>> The enclosed file contains a mail wich when received or trained throught
>> pop3prowy give me the following error:
>> (MacOS 9.1 24 Mo memory for Python 2.2.1)
>> ...
>> [HD:Dev:spambayes:tokenizer.py|tokenize_body|1254])
>Looks like the regular expression engine runs out of (C) stack space while
>trying to find HTML tags to strip.  I don't know enough about Macs to
>suggest something specific, but in general you have to do whatever it takes
>to convince he OS to give the program more stack space to work with.
>Short of that, reducing the instances of "2048" in html_re in tokenizer.py
>should make the problem go away, but since C stack space limits are
>platform-specific, it's impossible to say how small "is safe" for you
>without simply trying it over and over until the error goes away.
