Graham's spam filter (was Lisp to Python translation criticism?)
Erik Max Francis
max at alcyone.com
Wed Aug 21 01:41:00 CEST 2002
David LeBlanc wrote:
> Looking it over, I wonder if some optimizations aren't possible or
> desirable. One that came to mind is to retain url's/urn's as distinct
Yeah, that occured to me as well. I wrote the Graham filter code I
posted, did some basic checking to make sure it wasn't obviously wrong,
but haven't put it into practice. I already have a rule-based filter
(in Python) which is serving me pretty well; building up the corpora to
do the statistical filtering would be somewhat inconvenient at present.
> One thing I don't see how to do is to add a corpus containing a new
> (good or bad) to the database - i.e. update the database. Maybe
> Database.addGood() and Database.addBad()?
Ah, yeah, good point. Really the call to the .build method in
Database's constructor was a test driver; in reality you'd keep the good
and bad databases in attributes and be able to run .build manually.
Then you could just add data to the corpora as needed. Something like
def __init__(self, good, bad):
self.good = good
self.bad = bad
def build(self): # no arguments
ngood = self.good.count
# everything here changed to self.good and self.bad
Then to add something to the good corpus, for example, you'd just do
> With a known good message, I keep getting 0.0000... from the
> and I don't know if that's correct. With a known spam file I get 1.0.
Yes, that's right. If you pick things from the good or bad corpora, the
probabilities will reinforce strongly to make the calculated probability
either very near zero or one.
Erik Max Francis / max at alcyone.com / http://www.alcyone.com/max/
__ San Jose, CA, US / 37 20 N 121 53 W / ICQ16063900 / &tSftDotIotE
/ \ There is nothing so subject to the inconstancy of fortune as war.
\__/ Miguel de Cervantes
Church / http://www.alcyone.com/pyos/church/
A lambda calculus explorer in Python.
More information about the Python-list