Graham's spam filter (was Lisp to Python translation criticism?)

Sat Aug 17 18:10:12 EDT 2002

"Paul Rubin" wrote:

> Erik Max Francis <max at alcyone.com> writes:
> > One obvious and immediate issue is that for an industrial-strength
> > filter, the database gets _huge_ (Graham's basic setup involved 4000
> > messages each in the spam and nonspam corpora), and reading and writing
> > the database (even with cPickle) each time a spam message comes through
> > starts to become intensive.
>
> Why not use dbhash?  I think there's also a Python cdb wrapper somewhere.

Assuming you mean Dan Bernstein's cdb, there is a link to a Python wrapper
at the homepage http://cr.yp.to/cdb.html.

But I don't think that a pickled dictionary/database would be unmanageably
huge, even w/ a large set of input messages, since the rate of growth of the
"vocabulary" (i.e., set of tokens) would slow as more messages were input.
The spam probability database in particular is smaller than the "good" and
"bad" ones since it has a frequency threshold.

The databases don't necessarily have to be updated every time an email comes
in.  Only reading the (smaller) spam probability database is necessary to
determine whether an email is spam.  Updating the good and bad databases can
probably be left to a daily cron job.