[Spambayes] Training on unusual ham - revisited

Sun Feb 12 04:58:22 CET 2006

> The difficulty is that there's no way to prune the database, either to
> adjust the imbalance or to simply decrease the database's size. You  
> have
> to start again from scratch. The Spambayes establishment doesn't
> consider this to be much of an issue, since (as Seth points out)
> Spambayes does a good job of starting from scratch and building an
> acceptable scoring system after seeing surprisingly little data.

I'm not sure that I'd say that it's not considered much of an issue.   
The problem is that pruning a database is difficult.  As I understand  
it, the only safe way to do this is to remove/add entire messages,  
rather than individual tokens.  However, the SpamBayes database only  
keeps track of individual tokens and their ham/spam counts, so we  
don't have enough information to remove a set of added tokens, unless  
the original message is available.  (IIRC, Skip created an enhanced  
database that kept enough information to do this at some point; the  
code is probably around somewhere).

I'm still mostly of the opinion that using some sort of 'train to  
exhaustion' regime would work best.  This would allow both expiry and  
balancing (it essentially does pruning), and still deliver excellent  
results.  However, it would mean keeping cached mail around for a  
while, at least.  I just don't have the time at the moment (as the  
failure to get 1.1a2 out demonstrates) to implement this for the  
Outlook plug-in or sb_server (I did do a partial sb_server  
implementation some time ago, but I don't recall how far I got).

> Another point (I've made it before, but I guess it bears repeating) is
> that the database imbalance is absolutely inherent in the current
> implementation of the Spambayes algorithm, at least in the Outlook
> plugin. Because users set the cutoffs to avoid false positives (you  
> have
> to if the program is going to be useful), virtually all of Spambayes's
> mistakes are false negatives. Since mistakes are all you train on  
> after
> the initial startup, virtually all new entries into the database are
> spam. The better job Spambayes does, the worse the imbalance becomes.

Training should be done on all unsure messages, too.  When I was  
using the Outlook plug-in, I commonly had ham end up as (low scoring)  
unsure.  That should reduce the imbalance somewhat.  Theoretically,  
once SpamBayes starts making mistakes, the number of ham-as-unsure  
would increase, thus helping the balance.

Something that I think would help is not training every false  
negative/spam-as-unsure.  Something along the lines of training one,  
then rescoring the others to see if they need training.  However, the  
plug-in does not make this a simple task, at least at the moment.

> [...] it's a problem that has yet to be solved.

I certainly agree that this is true.  ISTM that the 'imbalance'  
problem is one that is shared by other filters, as well (c.f. the  
discussion of the problem in the TREC Spam Track papers).  Anyone  
know of a good statistician with time to spare?  <0.1 wink>

=Tony.Meyer

-- 
Please always include the list (spambayes at python.org) in your replies
(reply-all), and please don't send me personal mail about SpamBayes.
http://www.massey.ac.nz/~tameyer/writing/reply_all.html explains this.