[Spambayes] Training on unusual ham - revisited

Tony Meyer tameyer at ihug.co.nz
Sun Feb 12 05:10:47 CET 2006

> I think the problem is more that Spambayes doesn't do anything to
> encourage sensible training schemes.

I don't agree here.  The Outlook plug-in encourages train-on-error,  
because the simplest training is clicking the 'Spam' or 'Not Spam'  
buttons for mistakes (or dragging the messages to their proper  
place).  Train-on-error (fpfnunsure) seems to be one of the best  
regimes based on the testing done so far.  (The plug-in wizard  
probably encourages people too strongly to do initial training, which  
should be changed, I think).

sb_server was recently changed to encourage train-on-error  
(fpfnunsure) as well (this will make it into 1.1a2 if I ever find  
time to do a release, or if someone else does one).  The default  
action for ham and spam is 'discard', and unsure 'defer', encouraging  
people to only train unsures (and presumably fp and fn as corrections).

> It wouldn't be responsible for the
> developers to force one scheme or another on the users, since there is
> no proof that any one particular scheme would work for the majority of
> users.

I think that the testing that has been done certainly indicates that  
fpfnunsure, nonedge, and tte are all superior to train-on-everything  
in almost any situation.  (My TREC tests are the main contra-example  
I can think of, but they are clouded by the lack of the unsure range).

I think that the developers should set things up so that the simplest  
regime for users is one that is most likely to give results, while  
allowing users to use something else if they like.  I think sb_server  
does this fairly well, since it's easy to change the default actions  
so that you get train-on-everything with the least amount of work, or  
nonedge with the least amount of work.

> For example, a lot of spam has "word salad" added as hidden text to
> confuse Bayesian filters like Spambayes.

Random 'word salad' has most often been shown to help statistical  
filters like SpamBayes, not harm it.  People tend to use a fairly  
small vocabulary (compared to the entire language vocabulary) in  
their email (this is especially true if work and personal email is  
segregated).  As such, randomly selecting a word is more likely to  
result in a word outside of the user's typical email vocabulary than  
one inside.  This means it'll either not have been seen before (and  
be ignored), or have been seen in spam (particularly other 'word  
salad' spam) and actually increase the message score.

More clever spam, that include less random noise (e.g. newspaper  
clippings) are more of an issue.

> That's only if you define training on every unsure as using Spambayes
> correctly.  I disagree on that particular point, though the operating
> instructions don't say this.  Once Spambayes is operating well, you
> should probably not train on all the spam in the Unsure folder.

It is hard to try and explain this art to the average Outlook user,  
however.  (Suggestions are welcome ;)

> Finally, unless
> Spambayes implements some form of pruning old messages from the
> database,

Note that if pruning is done, it's not clear that age should be the  
deciding factor.  Then what happens to that once-a-year-ham?


Please always include the list (spambayes at python.org) in your replies
(reply-all), and please don't send me personal mail about SpamBayes.
http://www.massey.ac.nz/~tameyer/writing/reply_all.html explains this.

More information about the SpamBayes mailing list