Graham's spam filter (was Lisp to Python translation criticism?)

Erik Max Francis max at
Wed Aug 21 01:41:00 CEST 2002

David LeBlanc wrote:

> Looking it over, I wonder if some optimizations aren't possible or
> desirable. One that came to mind is to retain url's/urn's as distinct
> tokens.

Yeah, that occured to me as well.  I wrote the Graham filter code I
posted, did some basic checking to make sure it wasn't obviously wrong,
but haven't put it into practice.  I already have a rule-based filter
(in Python) which is serving me pretty well; building up the corpora to
do the statistical filtering would be somewhat inconvenient at present.

> One thing I don't see how to do is to add a corpus containing a new
> message
> (good or bad) to the database - i.e. update the database. Maybe
> Database.addGood() and Database.addBad()?

Ah, yeah, good point.  Really the call to the .build method in
Database's constructor was a test driver; in reality you'd keep the good
and bad databases in attributes and be able to run .build manually. 
Then you could just add data to the corpora as needed.  Something like

	class Database:
	    def __init__(self, good, bad):
	        self.good = good
	        self.bad = bad
	    def build(self): # no arguments
	        ngood = self.good.count
	        # everything here changed to self.good and self.bad

Then to add something to the good corpus, for example, you'd just do


> With a known good message, I keep getting 0.0000... from the
> Database.scan()
> and I don't know if that's correct. With a known spam file I get 1.0.

Yes, that's right.  If you pick things from the good or bad corpora, the
probabilities will reinforce strongly to make the calculated probability
either very near zero or one.

