[Spambayes] Leaving for another tool. [BUG + FIX]

Tue Dec 11 15:51:11 CET 2007

Jesse Pelton wrote:
> Sounds like there's a research paper in here somewhere, should anyone
> want one. I suspect that I'm one of a large majority of contented
> SpamBayes users, but that's hard to know. I'm quite happy with the
> filtering obtained by training SpamBayes using what seems to me an easy
> and intuitive approach: if SpamBayes puts a message in the wrong place,
> I move it where it belongs.
>  
> But there are some number of people who get results that leave them
> frustrated and annoyed. It would be interesting to know why that is.
> Different data? Different training methods? Different expectations?
> Something else altogether?
>  
> Maybe it's an indictment of the open source process that no one has
> answered these questions. There's no one to commission the research,
> just a few good souls who had a need, saw a possible solution, scratched
> their own itch, and kindly made the resulting software available for the
> rest of us. On the other hand, maybe it's just a matter of time, and
> Pete or Thomas or someone else will have the interest and resources to
> puzzle out an explanation for the wide range of experiences.

Based on what I've read, I decided to hose my Spambayes training 
database last night and start over using a new approach.  Using the POP3 
proxy, I have set (in the "Advanced" configuration) Unsure to select 
Spam as default, Ham to select Discard as default, and Spam to select 
Discard as default in the Review Messages page.  Based on what I've 
read, this is the "correct" approach.

However, also based on what I've read, Spambayes needs to keep its 
training database small to remain effective.  Therefore, training on 
more than exactly one pre-classified message is also wrong.  Here's why: 
  Spambayes pre-classifies a message as unsure, ham, or spam.  That 
classification sticks.  When the message is used to train Spambayes, 
Spambayes uses the original message and the user-based classification to 
train.  So far so good.  HOWEVER, the entire database and set of 
classifications is affected by that single message.  This means that 
training on other messages will most likely dilute the database UNLESS 
they are reclassified and filtered.  This is a serious bug from user, 
developer, AND statistical perspectives.  This bug's most likely source 
is the developers who retrain on a small set of messages daily that 
they've carefully whittled down by hand.  Users don't do this.

To fix this, Spambayes needs to reclassify all messages selected by the 
user and pick and choose which ones it actually needs to train on. 
Here's the ruleset that should be used (PHP-like pseudocode based on 
observations of the behavior of the POP3 proxy - sorry, I don't know 
Python):

$userclassification = $_POST["userclassify_" . $id];
if ($userclassification == "Ham" || $userclassification == "Spam")
{
   $reclassified = ClassifyTheOriginalMessage($id);
   if ($reclassified != $userclassification)
   {
     TrainDatabaseWithMessage($id, $userclassification);
   }
}

That simple change (probably a 5 minute fix for the developers) will 
keep the database really small no matter how users behave.

On the plus side, I am noticing a significant difference this time 
around.  Trained on just 20 messages so far and it is definitely working 
a lot better than my previous approach of training on everything 
(60,000+ messages - and took almost 300 messages to reach the same point 
I'm at now).  Still have a ways to go before I know for certain. 
Training one message at a time is going to take a while.

-- 
Thomas Hruska
CubicleSoft President
Ph: 517-803-4197

*NEW* MyTaskFocus 1.1
Get on task.  Stay on task.

http://www.CubicleSoft.com/MyTaskFocus/