[Spambayes] lots of unsures, heavily biased towards spam

Sun Feb 4 23:21:30 CET 2007

David Abrahams wrote on Sunday, February 04, 2007 11:14 AM -0600:

> "Seth Goodman" <sethg at goodmanassociates.com> writes:
>
> > My preference for adding ham to a training set is to pick the
> > highest scoring ham
>
> You mean literally the ones with scores closest to 1.0?

Exactly.  This is the "worst scoring ham", and training on those
messages should allow future messages that use similar language to score
"better" as ham, i.e. a spam score closer to zero.

>
> > and train on a few at a time, rescoring the ham folder after
> > training each new group.
>
> Sorry, lots of questions:
>
>    - what does "rescoring the ham folder" mean?
>
>    - When you say "ham folder" are you referring to a folder full of
>      ham used for training?
>
>    - If so, what difference would it make to allow Spambayes to adjust
>      the scores on those messages?
>
>    - When you "pick the highest scoring ham" are you picking from your
>      general mail history or are you picking from the ham folder and
>      training those mails again?
>
> I think a glossary or terminology section would be a nice addition to
> the spambayes site :)

OK, I see where the confusion is coming from.  I use the Outlook plugin,
which operates differently from the web interface.

message corpus: all messages, whether trained or not

training set:   messages you have trained

cache:          messages in training set (sb_server only)

database:       list of all words (tokens) in trained messages,
                number of times each seen in ham versus spam

ham folder:     ham, whether trained or not

spam folder:    spam, whether trained or not

rescore:        classify messages to see the effect of
                training changes

high scoring ham:  ham with a high spam score, i.e. larger number

low scoring spam:  spam with a low spam score, i.e. smaller number

Here's some discussion on how these terms relate to training.  A simple
way to train Spambayes initially is to segregate a group of messages
into ham and spam folders, then train on all of it.  For ongoing
training, the simplest approach is to train on all errors and unsures.
This works well enough for many users, and this makes Spambayes
practical.

You can do somewhat better with more effort.  Training on a message
affects not only the classification of that message, but other messages
with similar language.  You can do just as well, or sometimes better, by
training on a subset of the available messages.  The catch is
identifying the right ones.

There has been a lot of discussion about "minimalist" training sets and
various methods to achieve them, but they all have one thing in common.
They train on a small number of messages, run the classifier on all
available messages and then decide which messages to train next.  How
you pick the messages to train, how many you train at a time and when
you stop describes the training method.

Skip described his "train to exhaustion" method in a separate message.
This is the most recent, and perhaps the best, in a long line of
minimalist training schemes.  It is an iterative procedure where you
train on the worst scoring ham and spam, in groups with a fixed ham/spam
ratio, then rescore all messages and iterate.  Some messages that were
similar to the ones you just trained now score better, and you don't
have to train on them.  You continue until all messages score correctly
or you die from exhaustion :)  You need to run that as a script, since
it requires training on some messages more than once, which you can't do
from the normal user interface.

The method I alluded to is called "non-edge".  You train on the worst
scoring ham and spam, also in groups with a fixed ham/spam ratio, then
rescore all messages and iterate.  You continue until all messages are
below some threshold, which is typically "tighter" than the ham/spam
thresholds for classification.  I use halfway between the ham\spam
thresholds and a perfect score (0.0 or 1.0), i.e. (ham threshold)/2,
1-((1-spam threshold)/2).  Aside from the minor difference in threshold
where you stop, the important difference from train to exhaustion is
that you never train on a message more than once, so it is possible to
do this manually, though quite grueling.  Train to exhaustion probably
performs better.  Both methods select the messages that the classifier
does the worst on for further training, so they tend reduce unsures.

>
> > It's deliberately indefinite, as results are variable.  I can tell
> > you that my setup has been operating at around 5% unsures, 0.5%
> > false negatives (spam in the inbox) and perhaps one false positive
> > (ham in the spam folder) per year for a long time.  This seems to
> > be typical, though 0.1% false positives might be more common.  My
> > current training set has around 250 ham and 500 spam.  What kind of
> > performance do you see?
>
> Well, I haven't been measuring carefully, unfortunately.  I just have
> a feeling that I could do better.  After balancing ham and spam last
> night I woke up to 75 messages in my SPAMBOX all correctly identified
> as spam, 20 messages (all spam) in my UNSUREBOX and and 7 new messages
> in my INBOX, two of which were spam.  I have various server-side rules
> that are filing some new messages in other mailboxes but from a casual
> look it appears that none fell into those categories.  Just as for
> you, Spambayes has for years been very good about not classifying ham
> as spam.  However, it used to be that spam very rarely crept into my
> INBOX whereas recently I have been getting 2-3 false negatives every
> night.

Your mail last night was (7-2)/(75+20+7)=5% ham, while my received
messages are around 60% ham, a ten-fold difference.  Of a total of
75+20+2=97 spam you received, 20/97=20% classified as unsure (about the
same as the percentage of your total mail flow).  The fact that you had
good performance for a long time but it recently got worse suggests a
change in mail flow, which could be image spam.  Could you peruse your
next batch of spam that classifies as unsure and estimate how many of
them are image spam?

Though this doesn't address why performance went down after a long time,
the training methods I mentioned above may reduce unsures.  Another
possibility is the ham/spam thresholds, but that also doesn't address
why performance is worse than it used to be.

Your false negative rate was 2/(75+20+7)=2%, though two messages is too
small a sample.

> > So yes, end user feedback is very helpful.
>
> Great.  Another problem is that I don't have a rigorous way to measure
> performance.  Any ideas?

There is a statistics page in the Spambayes manager of the Outlook
plugin that now gives most of this.  I don't know where this is located
in sb_server.

> Dreaming of a tool that can record my configuration changes, record
> training records, and learn about misclassification based on the mails
> I throw into the ham and spam training folders (looking at the
> X-Spambayes-Classification header), so we can get a clearer picture of
> what works...

The log file includes much of that.

--
Seth Goodman