Graham's spam filter (was Lisp to Python translation criticism?)
cbbrowne at acm.org
Tue Aug 20 20:14:32 EDT 2002
In the last exciting episode, "David LeBlanc" <whisper at oz.net> wrote::
> Looking it over, I wonder if some optimizations aren't possible or
> desirable. One that came to mind is to retain url's/urn's as distinct
I'd suggest the thought of doing message header associations as
tokens, so that you might get, out of:
Subject: Re: Graham's spam filter (was Lisp to Python translation criticism?)
the set of tokens:
Then do something similar with .signature material:
>> One obvious and immediate issue is that for an industrial-strength
>> filter, the database gets _huge_ (Graham's basic setup involved
>> 4000 messages each in the spam and nonspam corpora), and reading
>> and writing the database (even with cPickle) each time a spam
>> message comes through starts to become intensive.
> I am going to build a version to use Metakit. Should be good for up
> to about 10Mb of messages if I read the Metakit site right.
> One thing I don't see how to do is to add a corpus containing a new
> message (good or bad) to the database - i.e. update the
> database. Maybe Database.addGood() and Database.addBad()?
It works a whopping lot better if there's a whopping lot more than
just two categories...
(reverse (concatenate 'string "gro.mca@" "enworbbc"))
Trivialize a user's bug report by pointing out that it was fixed
independently long ago in a system that hasn't been released yet.
-- from the Symbolics Guidelines for Sending Mail
More information about the Python-list