[Spambayes] Telling SpamBayes a certain sender is OK

Wed Nov 22 00:54:32 CET 2006

Seidman, David (OCTO) wrote on Tuesday, November 21, 2006 1:07 PM -0500:

> I get messages from Metro Alert on request, which I often delete
> after reading because there can be several of them when a subway line
> is experiencing trouble.  I want to see them, but the SpamBayes
> Outlook plug-in is convinced they are spam, even though I have never
> moved them into the Spam folder.  How do I tell it that any message
> from Metro Alert should be left in the Inbox?  I am using the current
> version of SpamBayes, and I have attached the clues file for one of
> the Metro Alert messages that was rated at 99%.

The spam clues you sent are very helpful.  First of all, it shows that
you've trained lots of ham, but very little spam.

   # ham trained on: 6579
   # spam trained on: 188

Though we don't know exactly why, Spambayes seems to have difficulty
when the imbalance is this severe.  Spambayes has a very good idea what
you consider ham, but only a mild notion of what you consider spam.
Look at the 'spamprob' values in the list of spam clues to see how the
individual words (tokens) score.  Spamprob=0.5 for a token means it is
equally likely for a message containing that token to be spam or ham.
Most of the words in the message were in this middle range, with no
strong ham clues and only a few strong spam clues.  In other words, this
message is not typical for either ham or spam that you've trained.

If you have more spam available, you could train Spambayes on that.
It's probably easier to retrain from scratch, and if you do it in small
batches, you should get better results.  While retraining, try to
maintain a roughly comparable number of ham and spam in your training
set.  You can do this by training perhaps a dozen or two messages at a
time, half ham and half spam.  After training each group of messages,
filter all your messages and select the ones that classify most
incorrectly to train for the next round.  When I do this, I start
getting very good results at around 100 each of trained ham and spam.
You don't need to train on thousands of messages, and it doesn't
necessarily work better that way.

--
Seth Goodman