[Spambayes] A proposal for mail filtering

Paul Wagland spambayes at kungfoocoder.org
Wed Dec 3 17:16:29 EST 2003

Hi all,

I currently use the IMAP filter program to do mail filtering, and have 
been running it in "learning" mode, that is, I specify the SPAM and HAM 
folders, and tell it to learn on them. My SPAM and HAM training folders 
used to correlate to my SPAM folder and my INBOX respectively. The 
problem with this is that, fortunately, I get much more ham than spam 
(please don't "fix" that ;-) ) and so my message counts were getting 
wildly out of synchronisation. So I have changed my HAM training folder 
to be my "Unsure" folder, doing a pseudo train on mistakes mode. The 
problem is, that this is still training on all of my spam, and so 
eventually my SPAM count will end up being too high as well.

My suggestion is to implement some form of mistakes based training.

My suggestion for this is as follows: (please feel free to jump in with 
improvements/criticisms/etc :-) )

In mistakes mode we still "train" on all messages, but we do not add 
the scores to either of ham or spam unless the message is being 
re-classified. When we detect that a message has been incorrectly 
classified then we increase the appropriate ham/spam score. To my way 
of thinking this means that we would then need to have five states 
associated with each message id.

1. Registered as HAM
2. Registered as SPAM
3. Registered as UNSURE
4. Trained as HAM
5. Trained as SPAM

Then the state transitions would be as follow:
[1,3] -> 5 : Add token scores to SPAM count
[2,3] -> 4 : Add token scores to HAM count
4 -> 5      : Add token scores to SPAM count, subtract from HAM
5 -> 4      : Add token scores to HAM count, subtract from SPAM

The last two transitions I would not expect to occur all that often, 
but people do make mistakes ;-) Since people really do appear to be of 
the opinion that it is better to have a balanced message count than an 
unbalanced one, maybe we could also automatically train on the last "x" 
HAM/SPAM (whichever needs to be "balanced") if the ratio of one to the 
other gets more than 1.5.

So, what do you think?

Good idea? Or am I just just smoking the good shit? :-)


More information about the Spambayes mailing list