[Spambayes] A proposal for mail filtering
spambayes at kungfoocoder.org
Wed Dec 3 17:16:29 EST 2003
I currently use the IMAP filter program to do mail filtering, and have
been running it in "learning" mode, that is, I specify the SPAM and HAM
folders, and tell it to learn on them. My SPAM and HAM training folders
used to correlate to my SPAM folder and my INBOX respectively. The
problem with this is that, fortunately, I get much more ham than spam
(please don't "fix" that ;-) ) and so my message counts were getting
wildly out of synchronisation. So I have changed my HAM training folder
to be my "Unsure" folder, doing a pseudo train on mistakes mode. The
problem is, that this is still training on all of my spam, and so
eventually my SPAM count will end up being too high as well.
My suggestion is to implement some form of mistakes based training.
My suggestion for this is as follows: (please feel free to jump in with
improvements/criticisms/etc :-) )
In mistakes mode we still "train" on all messages, but we do not add
the scores to either of ham or spam unless the message is being
re-classified. When we detect that a message has been incorrectly
classified then we increase the appropriate ham/spam score. To my way
of thinking this means that we would then need to have five states
associated with each message id.
1. Registered as HAM
2. Registered as SPAM
3. Registered as UNSURE
4. Trained as HAM
5. Trained as SPAM
Then the state transitions would be as follow:
[1,3] -> 5 : Add token scores to SPAM count
[2,3] -> 4 : Add token scores to HAM count
4 -> 5 : Add token scores to SPAM count, subtract from HAM
5 -> 4 : Add token scores to HAM count, subtract from SPAM
The last two transitions I would not expect to occur all that often,
but people do make mistakes ;-) Since people really do appear to be of
the opinion that it is better to have a balanced message count than an
unbalanced one, maybe we could also automatically train on the last "x"
HAM/SPAM (whichever needs to be "balanced") if the ratio of one to the
other gets more than 1.5.
So, what do you think?
Good idea? Or am I just just smoking the good shit? :-)
More information about the Spambayes