[Spambayes] Training on unusual ham - revisited
tameyer at ihug.co.nz
Sun Feb 12 05:10:47 CET 2006
> I think the problem is more that Spambayes doesn't do anything to
> encourage sensible training schemes.
I don't agree here. The Outlook plug-in encourages train-on-error,
because the simplest training is clicking the 'Spam' or 'Not Spam'
buttons for mistakes (or dragging the messages to their proper
place). Train-on-error (fpfnunsure) seems to be one of the best
regimes based on the testing done so far. (The plug-in wizard
probably encourages people too strongly to do initial training, which
should be changed, I think).
sb_server was recently changed to encourage train-on-error
(fpfnunsure) as well (this will make it into 1.1a2 if I ever find
time to do a release, or if someone else does one). The default
action for ham and spam is 'discard', and unsure 'defer', encouraging
people to only train unsures (and presumably fp and fn as corrections).
> It wouldn't be responsible for the
> developers to force one scheme or another on the users, since there is
> no proof that any one particular scheme would work for the majority of
I think that the testing that has been done certainly indicates that
fpfnunsure, nonedge, and tte are all superior to train-on-everything
in almost any situation. (My TREC tests are the main contra-example
I can think of, but they are clouded by the lack of the unsure range).
I think that the developers should set things up so that the simplest
regime for users is one that is most likely to give results, while
allowing users to use something else if they like. I think sb_server
does this fairly well, since it's easy to change the default actions
so that you get train-on-everything with the least amount of work, or
nonedge with the least amount of work.
> For example, a lot of spam has "word salad" added as hidden text to
> confuse Bayesian filters like Spambayes.
Random 'word salad' has most often been shown to help statistical
filters like SpamBayes, not harm it. People tend to use a fairly
small vocabulary (compared to the entire language vocabulary) in
their email (this is especially true if work and personal email is
segregated). As such, randomly selecting a word is more likely to
result in a word outside of the user's typical email vocabulary than
one inside. This means it'll either not have been seen before (and
be ignored), or have been seen in spam (particularly other 'word
salad' spam) and actually increase the message score.
More clever spam, that include less random noise (e.g. newspaper
clippings) are more of an issue.
> That's only if you define training on every unsure as using Spambayes
> correctly. I disagree on that particular point, though the operating
> instructions don't say this. Once Spambayes is operating well, you
> should probably not train on all the spam in the Unsure folder.
It is hard to try and explain this art to the average Outlook user,
however. (Suggestions are welcome ;)
> Finally, unless
> Spambayes implements some form of pruning old messages from the
Note that if pruning is done, it's not clear that age should be the
deciding factor. Then what happens to that once-a-year-ham?
Please always include the list (spambayes at python.org) in your replies
(reply-all), and please don't send me personal mail about SpamBayes.
http://www.massey.ac.nz/~tameyer/writing/reply_all.html explains this.
More information about the SpamBayes