[Spambayes] Leaving for another tool. [BUG + FIX]
thruska at cubiclesoft.com
Tue Dec 11 15:51:11 CET 2007
Jesse Pelton wrote:
> Sounds like there's a research paper in here somewhere, should anyone
> want one. I suspect that I'm one of a large majority of contented
> SpamBayes users, but that's hard to know. I'm quite happy with the
> filtering obtained by training SpamBayes using what seems to me an easy
> and intuitive approach: if SpamBayes puts a message in the wrong place,
> I move it where it belongs.
> But there are some number of people who get results that leave them
> frustrated and annoyed. It would be interesting to know why that is.
> Different data? Different training methods? Different expectations?
> Something else altogether?
> Maybe it's an indictment of the open source process that no one has
> answered these questions. There's no one to commission the research,
> just a few good souls who had a need, saw a possible solution, scratched
> their own itch, and kindly made the resulting software available for the
> rest of us. On the other hand, maybe it's just a matter of time, and
> Pete or Thomas or someone else will have the interest and resources to
> puzzle out an explanation for the wide range of experiences.
Based on what I've read, I decided to hose my Spambayes training
database last night and start over using a new approach. Using the POP3
proxy, I have set (in the "Advanced" configuration) Unsure to select
Spam as default, Ham to select Discard as default, and Spam to select
Discard as default in the Review Messages page. Based on what I've
read, this is the "correct" approach.
However, also based on what I've read, Spambayes needs to keep its
training database small to remain effective. Therefore, training on
more than exactly one pre-classified message is also wrong. Here's why:
Spambayes pre-classifies a message as unsure, ham, or spam. That
classification sticks. When the message is used to train Spambayes,
Spambayes uses the original message and the user-based classification to
train. So far so good. HOWEVER, the entire database and set of
classifications is affected by that single message. This means that
training on other messages will most likely dilute the database UNLESS
they are reclassified and filtered. This is a serious bug from user,
developer, AND statistical perspectives. This bug's most likely source
is the developers who retrain on a small set of messages daily that
they've carefully whittled down by hand. Users don't do this.
To fix this, Spambayes needs to reclassify all messages selected by the
user and pick and choose which ones it actually needs to train on.
Here's the ruleset that should be used (PHP-like pseudocode based on
observations of the behavior of the POP3 proxy - sorry, I don't know
$userclassification = $_POST["userclassify_" . $id];
if ($userclassification == "Ham" || $userclassification == "Spam")
$reclassified = ClassifyTheOriginalMessage($id);
if ($reclassified != $userclassification)
That simple change (probably a 5 minute fix for the developers) will
keep the database really small no matter how users behave.
On the plus side, I am noticing a significant difference this time
around. Trained on just 20 messages so far and it is definitely working
a lot better than my previous approach of training on everything
(60,000+ messages - and took almost 300 messages to reach the same point
I'm at now). Still have a ways to go before I know for certain.
Training one message at a time is going to take a while.
*NEW* MyTaskFocus 1.1
Get on task. Stay on task.
More information about the SpamBayes