Graham's spam filter
Joseph A. Knapka
jknapka at earthlink.net
Thu Aug 22 18:33:13 CEST 2002
Oren Tirosh wrote:
> On Wed, Aug 21, 2002 at 10:48:59PM -0700, Erik Max Francis wrote:
> > > What this program momentarily tries to implement is a client/server
> > > based protocol with authentication that allows some program to contact
> > > the server for classifying text that is passed in, working around the
> > > limitation that was discussed on the mailing-list that it is quite bad
> > > for response time to always have to reload the database on scanning.
> > I don't that this is necessarily true; certainly and without a doubt,
> > reloading the _entire_ database each time is a non-starter. The
> > possibility of using a gdbm or similar database system might shorten
> > those times to very reasonable amounts, but this is something I haven't
> > researched yet.
> Reloading the entire database is not necessarily a non-starter. If the
> database is represented as some kind of hash table in a linear memory block
> without using any pointers it can be mmapped. The page cache will take
> care of the rest. I think this is easier to implement and manage than a
> client-server solution. I won't be surprised if it's faster, too.
Hmm. My version of this has two programs:
- an analyzer that starts up once per day or so and reads the
corpus, writing the token-->spam probabilities to a plain old
file as a dictionary.
- the filter, which just opens that file and eval()s the contents.
The analyzer takes about two minutes to go through my corpus of
about 2000 messages. The filter starts and loads the probability
dictionary in under five seconds. Doesn't seem like a non-starter
to me :-) (Of course, the user should never have to deal with
either program, except to configre it. The filter reads from
a POP3 or IMAP mailbox and writes the spam-free messages
either to a file or to another "sanitized" SMTP mailbox,
which is the one the user checks.)
"I'd rather chew my leg off than maintain Java code, which
sucks, 'cause I have a lot of Java code to maintain and
the leg surgery is starting to get expensive." - Me
More information about the Python-list