Graham's spam filter

Thu Aug 22 01:48:59 EDT 2002

Heiko Wundram wrote:

> I've started implementing a little program using Graham's spam filter
> for filtering mail.
> 
> What this program momentarily tries to implement is a client/server
> based protocol with authentication that allows some program to contact
> the server for classifying text that is passed in, working around the
> limitation that was discussed on the mailing-list that it is quite bad
> for response time to always have to reload the database on scanning.

I don't that this is necessarily true; certainly and without a doubt,
reloading the _entire_ database each time is a non-starter.  The
possibility of using a gdbm or similar database system might shorten
those times to very reasonable amounts, but this is something I haven't
researched yet.

As I said earlier, one blocking issue for me in actually putting the
filter into practice is the lack of good corpora (one for spam, one for
non-spam); I keep all mail I receive, but the "backups" that I have
usually consist of all the email I've ever received.  (I certainly have
kept a lot of good mail, but of course I've deleted a lot more, so it's
hard to know whether or not it would be useful.)  Note that if, from now
on, I did manage to keep a corpus of all good email I've received
alongside all email (both good and bad), it would be easy to apply
simple subtraction to determine the good and bad figures (which are
needed by Graham's algorithm), but what I have now consists of only some
good messages going back through time and all email I've ever received
(good and bad) since I switched over to my new rule-based Python filter.

I think in my second pass through the idea I'll attempt to orchestrate a
database lookup that's more efficient and hopefully doesn't need to go
the full client/server route (since that opens up more avenues for
failure).  I think I'll employ a combination of ideas that have been
presented here -- such as distinguishing keywords by their place in the
file (i.e., if the word "spam" appears in the Subject header it would be
distinguished as subject/spam for greater scrutiny), as well things like
treating full email addresses and URLs as one single keyword instead of
letting the tokenizer chop them up into unrecognizeable forms. 
Additionally, I'll probably apply a simple "keywordizing" algorithm that
strips off grammatical endings in order to try to get at what the word
is (i.e., remove -s, -er, -ed, -ing, -'s, and so on endings; these may
make the keywords "unreadable" but of course they're going to be
analyzed by the algorithm anyway, not by a human.

I'll let people know what I come up with.

-- 
 Erik Max Francis / max at alcyone.com / http://www.alcyone.com/max/
 __ San Jose, CA, US / 37 20 N 121 53 W / ICQ16063900 / &tSftDotIotE
/  \ There is nothing so subject to the inconstancy of fortune as war.
\__/ Miguel de Cervantes
    Church / http://www.alcyone.com/pyos/church/
 A lambda calculus explorer in Python.