Graham's spam filter

Thu Aug 22 17:01:19 EDT 2002

So then, Erik Max Francis <max at alcyone.com> is all like:

> I don't that this is necessarily true; certainly and without a doubt,
> reloading the _entire_ database each time is a non-starter.  The
> possibility of using a gdbm or similar database system might shorten
> those times to very reasonable amounts, but this is something I
> haven't researched yet.

Well, I've actually written a Bayesian spam filter, and brother let me
tell you, it works and it works fast.  I can't release it (yet) cause my
employer hasn't given me the green light yet, but it consists of a
procmail-invokable thingy that writes new headers very much like
spamassassin.

I have the announcement all queued up and ready to send as soon as I get
the go-ahead from upstairs.  I've been waiting for two days.  Sigh.  If
only I'd had some GPL code to build it into.

Anyhow, using anydbm instead of a cPickled dictionary turned a
15-seconds-per-message operation into a 0.3-seconds-per-message
operation.  My database is now 5 megs, mostly text.  I'm using a Berkely
hash database file.  So yes, a database file is absolutely worth it.
Don't even bother using flat files like bogofilter currently does.

> As I said earlier, one blocking issue for me in actually putting the
> filter into practice is the lack of good corpora (one for spam, one
> for non-spam);

As far as I can tell, you don't need a large body of input messages
before you start to get pretty good results.  The larger the input, the
better the results, of course, but it you spent 30 minutes filing
messages I bet you'd have enough to get you going.  Then, just make sure
you move all the messages you've verified as spam into a special folder,
and run that through the corpus analyzer.

I'm currently working on a system that will pull messages out of a Gnus
nnmail directory or a Berkeley mbox folder.  Then you just drag your
message into a "good" or "bad" folder, and every night the mail fairy
(cron job) will learn what sorts of things you don't want to see
tomorrow.

> I think I'll employ a combination of ideas that have been presented
> here -- such as distinguishing keywords by their place in the file
> (i.e., if the word "spam" appears in the Subject header it would be
> distinguished as subject/spam for greater scrutiny), as well things
> like treating full email addresses and URLs as one single keyword
> instead of letting the tokenizer chop them up into unrecognizeable
> forms.

But it turns out that it doesn't matter much--at least not at this point
in time.  Just using the method presented by Graham seems to be good
enough to catch nearly every spam I've gotten.

One thing you *should* do, though, is skip base64-encoded stuff.  That
will just clutter up your database.

I wish I could give away the code I wrote.  Keep your fingers crossed
that $FIRM will hop to it quickly.

Neale