[Python-Dev] Re: [Python-checkins] python/nondist/sandbox/spambayes GBayes.py,1.7,1.8

Tue, 20 Aug 2002 20:52:52 -0400

(Copied to Paul Graham.  Paul, this is the mailing list of the Python
maintainers.  I thought you'd find the bits about lexical analysis in
bogofilter interesting.  Pythonistas, Paul is one of the smartest and
sanest people in the LISP community, as evidenced partly by the fact
that he hasn't been too proud to learn some lessons from Python :-).
It would be a good thing for some bridges to be built here.)

Tim Peters <tim.one@comcast.net>:
> What we anticipate is that the vast bulk of the time will end up getting
> spent on better tokenization, such as decoding base64 portions, and giving
> special care to header fields and URLs.

This is one of bogofilter's strengths.  It already does this stuff at
the lexical level using a speed-tuned flex scanner (I spent a lot of
the development time watching token strings go by and tweaking the
scanner rules to throw out cruft).  

In fact, look at this.  It's a set of lex rules with interpolated comments:

BASE64		[A-Za-z0-9/+]
IPADDR		[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+
MIME_BOUNDARY	^--[^[:blank:]\n]*$

%%

# Recognize and discard From_ headers
^From\ 						{return(FROM);}

# Recognize and discard headers that contain dates and tumblers.
^Date:.*|Delivery-Date:.*			;
^Message-ID:.*					;

# Throw away BASE64 enclosures.  No point in using this as a discriminator;
# spam that has these in it always has other triggers in the headers.
# Therefore don't pay the overhead to decode it.
^{BASE64}+$					;

# Throw away tumblers generated by MTAs
^\tid\ .*					;
SMTP\ id\ .*					;

# Toss various meaningless MIME cruft
boundary=.*					;
name=\"						;
filename=\"					;
{MIME_BOUNDARY}					;

# Keep IP addresses
{IPADDR}					{return(TOKEN);}

# Keep wordlike tokens of length at least three
[a-z$][a-z0-9$'.-]+[a-z0-9$]			{return(TOKEN);}

# Everything else, including all one-and-two-character tokens,
# gets tossed.
.						;

This small set of rules does a remarkably effective job of tossing
out everything that isn't a potential recognition feature.  Watching
the filtered token stream from a large spam corpus go by is actually
quite an enlightening experience.

It does a better job that Paul's original in one important respect; IP
addresses and hostnames are preserved whole for use as recognition
features.  I think I know why Paul didn't do this -- he's not a C
coder, and if lex/flex isn't part of one's toolkit one's brain doesn't
tend to wander down design paths that involve elaborate lexical
analysis, because it's just too hard.

This is actually the first new program I've coded in C (rather than
Python) in a good four years or so.  There was a reason for this;
I have painful experience with doing lexical analysis in Python that
tells me flex-generated C will be a major performance win here.  The
combination of flex and Judy made it a no-brainer.

>                                        I also *suspect* (based on a
> previous life in speech recogniation) that experiments will show that a
> mixture of character n-grams and word bigrams is significantly more
> effective than a "naive" tokenizer that just looks for US ASCII alphanumeric
> runs.

I share part of your suspicions -- I'm thinking about going to bigram
analysis for header lines.  But I'm working on getting the framework
up first.  Feature extraction is orthogonal to both the Bayesian
analysis and (mostly) to the data-storage method, and can be a drop-in
change if the framework is done right.

> In fact, I'm going to say "Nigerian" and "penis enlargement" one more time
> each here, just to demonstrate that *this* message won't be a false positive
> when the smoke settles <wink>.  Human Growth Hormone too, while I'm at it.

I think I know why, too.  It's the top-15 selection -- the highest-variance
words don't blur into non-spam English the way statistics on *all* tokens
would.  It's like an edge filter.
-- 
		<a href="http://www.tuxedo.org/~esr/">Eric S. Raymond</a>