[Spambayes] introduction + date filtering for hammie
Fri Nov 15 22:17:28 2002
I'm new to this list. I played with content-based spam-filtering a few
years ago in perl, and after coming across Gary Robinson's article (and
Graham's) was excited enough to implemented both of these approaches in
I was interested in using an approach with multiple metrics, which would
include bayesian calculations as well as other ad-hoc measurements (i.e.
the percentage of sentences ending with exclamation marks).
I took these inputs and fed them into a back-progogating neural network
(BPNN) using a python module I found on the web. My hope was that the
neural network would find the optimum "weights" to use for combining the
multiple inputs into a single output, and would also determine the
optimum cutoff-point between ham/spam, so that no "tweaking" would be
My initial tests (training on 100-500 emails) showed the neural network
approach (using Robinson as one of the metrics) was somewhat better than
either the Graham and Robinson without using the BPNN. However, when I
started training on larger corpuses (I've been collecting spam since
1998), its accuracy degraded. I did some more reading on the
limitations of BPNNs (namely overtraining), and this result made sense.
So now I've ended up here. :)
I'm still getting up-to-speed on the spambayes code. So far, I have one
improvement to offer:
Since your documentation stresses the importance of training using only
relatively recent emails, I thought a good way to do this would be to
have hammie filter out old messages for me. So I added a new
# when training, hammie will ignore messages older than this number of
# i.e. set to 365 to ignore messages older than one year.
# Set to 0 to disable any filtering by date.
I also modified Hammie to output the number of messages it read/ignored
for each mail file it processes.
This option might also prove useful for doing incremental training (i.e.
set up cron to train once a week, and set ignore_old_messages to 7).
Caveat: this won't catch spams whose dates are deliberately set in the
past, such as January 1, 1970 (I've seen a few).
I've uploaded the patch to the sourceforge project page; hopefully
someone has time to take a look at it.
Jason D. Hildebrand
More information about the Spambayes