[Spambayes] introduction + date filtering for hammie

Fri Nov 15 22:17:28 2002

Hi all,

I'm new to this list.  I played with content-based spam-filtering a few
years ago in perl, and after coming across Gary Robinson's article (and
Graham's) was excited enough to implemented both of these approaches in
python.

I was interested in using an approach with multiple metrics, which would
include bayesian calculations as well as other ad-hoc measurements (i.e.
the percentage of sentences ending with exclamation marks).  

I took these inputs and fed them into a back-progogating neural network
(BPNN) using a python module I found on the web.  My hope was that the
neural network would find the optimum "weights" to use for combining the
multiple inputs into a single output, and would also determine the
optimum cutoff-point between ham/spam, so that no "tweaking" would be
required.  

My initial tests (training on 100-500 emails) showed the neural network
approach (using Robinson as one of the metrics) was somewhat better than
either the Graham and Robinson without using the BPNN.  However, when I
started training on larger corpuses (I've been collecting spam since
1998), its accuracy degraded.  I did some more reading on the
limitations of BPNNs (namely overtraining), and this result made sense.

So now I've ended up here.  :)

I'm still getting up-to-speed on the spambayes code.  So far, I have one
improvement to offer:

Since your documentation stresses the importance of training using only
relatively recent emails, I thought a good way to do this would be to
have hammie filter out old messages for me.  So I added a new
configuration option: 

[Hammie] 
# when training, hammie will ignore messages older than this number of
days. 
# i.e. set to 365 to ignore messages older than one year. 
# Set to 0 to disable any filtering by date. 
ignore_old_messages: 0 

I also modified Hammie to output the number of messages it read/ignored
for each mail file it processes. 

This option might also prove useful for doing incremental training (i.e.
set up cron to train once a week, and set ignore_old_messages to 7).
Caveat: this won't catch spams whose dates are deliberately set in the
past, such as January 1, 1970 (I've seen a few).

I've uploaded the patch to the sourceforge project page; hopefully
someone has time to take a look at it.

-- 
Jason D. Hildebrand
jason@peaceworks.ca