
(Copied to Paul Graham. Paul, this is the mailing list of the Python maintainers. I thought you'd find the bits about lexical analysis in bogofilter interesting. Pythonistas, Paul is one of the smartest and sanest people in the LISP community, as evidenced partly by the fact that he hasn't been too proud to learn some lessons from Python :-). It would be a good thing for some bridges to be built here.) Tim Peters <tim.one@comcast.net>:
What we anticipate is that the vast bulk of the time will end up getting spent on better tokenization, such as decoding base64 portions, and giving special care to header fields and URLs.
This is one of bogofilter's strengths. It already does this stuff at the lexical level using a speed-tuned flex scanner (I spent a lot of the development time watching token strings go by and tweaking the scanner rules to throw out cruft). In fact, look at this. It's a set of lex rules with interpolated comments: BASE64 [A-Za-z0-9/+] IPADDR [0-9]+\.[0-9]+\.[0-9]+\.[0-9]+ MIME_BOUNDARY ^--[^[:blank:]\n]*$ %% # Recognize and discard From_ headers ^From\ {return(FROM);} # Recognize and discard headers that contain dates and tumblers. ^Date:.*|Delivery-Date:.* ; ^Message-ID:.* ; # Throw away BASE64 enclosures. No point in using this as a discriminator; # spam that has these in it always has other triggers in the headers. # Therefore don't pay the overhead to decode it. ^{BASE64}+$ ; # Throw away tumblers generated by MTAs ^\tid\ .* ; SMTP\ id\ .* ; # Toss various meaningless MIME cruft boundary=.* ; name=\" ; filename=\" ; {MIME_BOUNDARY} ; # Keep IP addresses {IPADDR} {return(TOKEN);} # Keep wordlike tokens of length at least three [a-z$][a-z0-9$'.-]+[a-z0-9$] {return(TOKEN);} # Everything else, including all one-and-two-character tokens, # gets tossed. . ; This small set of rules does a remarkably effective job of tossing out everything that isn't a potential recognition feature. Watching the filtered token stream from a large spam corpus go by is actually quite an enlightening experience. It does a better job that Paul's original in one important respect; IP addresses and hostnames are preserved whole for use as recognition features. I think I know why Paul didn't do this -- he's not a C coder, and if lex/flex isn't part of one's toolkit one's brain doesn't tend to wander down design paths that involve elaborate lexical analysis, because it's just too hard. This is actually the first new program I've coded in C (rather than Python) in a good four years or so. There was a reason for this; I have painful experience with doing lexical analysis in Python that tells me flex-generated C will be a major performance win here. The combination of flex and Judy made it a no-brainer.
I also *suspect* (based on a previous life in speech recogniation) that experiments will show that a mixture of character n-grams and word bigrams is significantly more effective than a "naive" tokenizer that just looks for US ASCII alphanumeric runs.
I share part of your suspicions -- I'm thinking about going to bigram analysis for header lines. But I'm working on getting the framework up first. Feature extraction is orthogonal to both the Bayesian analysis and (mostly) to the data-storage method, and can be a drop-in change if the framework is done right.
In fact, I'm going to say "Nigerian" and "penis enlargement" one more time each here, just to demonstrate that *this* message won't be a false positive when the smoke settles <wink>. Human Growth Hormone too, while I'm at it.
I think I know why, too. It's the top-15 selection -- the highest-variance words don't blur into non-spam English the way statistics on *all* tokens would. It's like an edge filter. -- <a href="http://www.tuxedo.org/~esr/">Eric S. Raymond</a>