[Spambayes] Training on unusual ham - revisited

Fri Feb 10 01:23:28 CET 2006

There were a few mistakes too many in my previous post, so I am
reposting with corrections.

On Thursday, February 09, 2006 7:32 AM -0600, Bob Coe wrote:

> The difficulty is that there's no way to prune the database, either to
> adjust the imbalance or to simply decrease the database's size. You
> have to start again from scratch. The Spambayes establishment doesn't
> consider this to be much of an issue, since (as Seth points out)
> Spambayes does a good job of starting from scratch and building an
> acceptable scoring system after seeing surprisingly little data.
>
> This is all fine if you can limit your spam flow to a trickle during
> this startup period. But if you can't, things can be very unpleasant
> for a while. As part of an upgrade of my home system, I recently had
> occasion to install Spambayes from scratch on two accounts (mine and
> my wife's) that receive a LOT of spam. (My home domain name is a
> catchy one that attracts spammers and forgers like flowers attract
> bees.)

Your cup runneth over.  I feel your pain.

<OT plea>
This is an off-topic plea to the few people who have some control over
the MTA software running at their site.  Please consider using DNSBL's,
both from responsible third parties and built from local heuristics, in
addition to existing authentication mechanisms, to cut down on the
volume of spam that must be processed post-acceptance.  Accepting spam
for delivery wastes your bandwidth and consumes CPU cycles.  Rejecting
it, preferably during or right after the envelope phase of SMTP, can
really reduce your load.  An ounce of prevention ...
</OT plea>

> So while Spambayes was in its learning curve, hundreds of spam
> messages were pouring in and getting sent to our "possible spam"
> folders. And because all I had to train on was ham, anything that
> didn't go there went to our inboxes. For two or three days, until
> Spambayes got its mind right, I had to dig through this chaff and
> send it to the spam folders manually - not a fun task.

I'm not a Spambayes developer, so I am speaking only for myself here.  I
think the problem is more that Spambayes doesn't do anything to
encourage sensible training schemes.  And for a very good reason: the
jury is still out on what is the "best" training scheme, or even one
that is universally acceptable.  It wouldn't be responsible for the
developers to force one scheme or another on the users, since there is
no proof that any one particular scheme would work for the majority of
users.

That being said, there are some things you can do manually to avoid some
of this pain.  The first thing to note is that training on all unsures,
forever, is usually not the best approach.  Aside from building huge
databases, it tends to produce a trained ham/spam ratio very far from
unity, as you describe very well below.

Here's one approach that works for me and builds a smaller database with
a relatively equal number of ham and spam.

1) Initial training.

a) If you are just starting out from scratch, manually sort the spam in
your inbox into a separate spam folder.  Make sure you have roughly
equal numbers of ham and spam for training, even if this means training
on a relatively small number of messages.  It is not necessary to use a
large number of messages.  Anywhere from around ten to a few hundred of
each type is sufficient.  In addition to messages in the Inbox, most
people have a large amount of saved ham.  Resist the temptation to train
on a large folder of saved ham without the same number of spam
available.

b) When you have around 25 spam in your Spam folder, make sure you have
the same number of your most recent ham in the Inbox (or ham training
folder).  If you have more ham messages than spam, temporarily move some
of the ham to a new folder, then move it back when you're done.  If you
have fewer ham than spam, temporarily move some saved ham into the Inbox
(or ham training folder).

c) I don't recall what the default thresholds are, but I personally use
0.80 for spam and 0.05 for ham.

d) Train on the two folders with equal numbers of ham and spam.

e) In the Spambayes Manager under the Training tab, uncheck the
incremental training checkboxes so that moving a message does not
automatically train on it.

2)Until Spambayes is working well (i.e. 5% Unsures and 0.1% false
positives), try this procedure instead of simply training on all
messages in the Unsure folder.  When new messages appear in your Unsure
folder:

a) Make sure the spam score is displayed in the Inbox, Unsure and Spam
folders.

b) Hit the Spambayes button on the Outlook toolbar, select "Filter
messages" and the "Filter Now" dialog box opens.  Make sure the "Filter
the following messages" field contains your Inbox (or other ham
folders), Unsure and Spam folders.  Under "Filter action" select
"Perform all filter actions".  Under "Restrict the filter to", uncheck
both boxes.  Hit the "Close" button.

c) Select the lowest scoring spam message, if any, in the Unsure folder
and hit the "Delete as Spam" button.  Select the highest scoring ham
message, if any, in the Unsure folder and hit the "Recover from Spam"
button.  This trains on the messages as well as moving them.  For this
to have any effect, these should be messages that are not already
trained.  Don't worry, if you select a message that is already trained,
Spambayes will move it but it won't train on it again.  Once you
remember that you trained on a message, don't bother selecting that
message again for training.  If you accidentally train a message in the
wrong category, it is very important that you select it and train it in
the correct category.

d) Hit the Spambayes button on the Outlook toolbar, select "Filter
messages" and the "Filter Now" dialog box opens.  Hit the "Start
Filtering" button.  When it is finished filtering, hit the "Close"
button.  Some messages may disappear from the Unsure folder and others
may move into it.

e) Glance at your Inbox and Spam folders for false positives (ham in the
spam folder) and false negatives (spam in the ham folder).  Train on any
of these using the "Delete as Spam" and "Recover from Spam" buttons.
This should not happen very often.

f) Go back to step c and repeat this process on each message in the
Unsure folder that you haven't already trained on.  Occasionally, a
message will still score as unsure even after you've trained on it, or
subsequent training may cause a message that previously classified
correctly to now classify as Unsure.  Don't worry, it will eventually
classify correctly when you train on other similar messages.

g) Once in a while, Hit the Spambayes button on the Outlook toolbar and
select "Spambayes Manager".  In the "General" tab, there is a field that
tells you how many ham and spam are trained.  If the numbers are
somewhat unequal (more than 2:1 in either direction), train on some
messages of the type that has fewer in the training set.  The easiest
way to do this is to move the message into the Unsure folder, then hit
either the "Delete as Spam" and "Recover from Spam" buttons.  When
picking additional messages to train on, try to use the lowest scoring
spam or the highest scoring ham, i.e. the messages that came closest to
being Unsures.  Look at the "General" tab again when you're done to make
sure that the trained ham and spam numbers are more equal.  If you train
on a message that is already trained, Spambayes moves it but will not
train on it again.

h) When Spambayes is working well, go on to step 3.

3) You know that Spambayes is working well when Unsures are not more
than 5-10% of your incoming mail flow and you rarely (~0.1%) have a ham
message classify as Spam.  Since spam is much more likely to classify as
unsure than ham, the percentage of your messages that are unsures
depends on your incoming ham/spam ratio.  Once your are at this level of
performance, it is probably reasonable to train on any ham that ends up
in the Unsure folder, but generally don't train on Spam in the Unsure
folder.  That is, only train on Unsure spam that you think the filter
_really_ should have caught.

For example, a lot of spam has "word salad" added as hidden text to
confuse Bayesian filters like Spambayes.  These are either random words
from a dictionary or passages from news articles or books.  The net
result of including a lot of random words in a spam is to have it score
somewhere around 50%.  Spambayes already ignores any words that score
between 0.4 and 0.6, so a message's score is only the result of words
that are considered ham or spam words.  It's debatable if you want to
train on enough messages to have Spambayes correctly ignore "word salad"
words.  You can train on word salad spam, but if you do, the databases
will get larger and you will start to see more ham wind up in the Unsure
folder, thus requiring further training to correct it.

In short, once you get acceptable performance, there is nothing wrong
with just deleting spam that ends up in the Unsure folder.  No matter
how much training you do, there will always be some spam that classifies
as Unsure.  You only need to do further training if overall performance
declines.  Playing too much with the thresholds is dangerous and risks
getting false positives, which is the worst possible outcome for a spam
filter.

> Another point (I've made it before, but I guess it bears repeating) is
> that the database imbalance is absolutely inherent in the current
> implementation of the Spambayes algorithm, at least in the Outlook
> plugin. Because users set the cutoffs to avoid false positives (you
> have to if the program is going to be useful), virtually all of
> Spambayes's mistakes are false negatives. Since mistakes are all you
> train on after the initial startup, virtually all new entries into the
> database are spam.

That is inherent, as you say.  See step 2g, above, to correct this
problem, and step 3, above, to avoid it.

This low false positive rate is actually one of the more desirable
features of Spambayes.  This is the result of the two thresholds being a
different distance from 0.5 and a number of heuristics that proved to
help.  The natural result of this is that virtually all the unsures are
spam.  That is an advantage for the user, as you don't have to be as
vigilant about looking at the Unsures, and even less so with the Spam
folder, since they rarely contain ham.

> The better job Spambayes does, the worse the imbalance becomes.
> Note that the ham/spam ratio of incoming messages affects only the
> speed with which this effect takes hold, not the eventual outcome.
> If you use Spambayes correctly, and use it long enough, your
> database *will* achieve a highly distorted ham/spam balance.

That's only if you define training on every unsure as using Spambayes
correctly.  I disagree on that particular point, though the operating
instructions don't say this.  Once Spambayes is operating well, you
should probably not train on all the spam in the Unsure folder.

> If that degrades performance, and many believe that it does, then
> it's a problem that has yet to be solved.

I do agree with this.  I think it would be a good idea, for example, if
the initial training tab of Spambayes had an (experimental?) option for
training on an equal number of ham and spam, using the smaller of the
number of messages in the indicated ham and spam folders.  I also think
it would be a good idea if Spambayes had an experimental option to pop
up a warning if the numbers of trained ham and spam were different by
more than a user-defined ratio (which could be 1.5:1 as default) and
suggest what the user should do to correct it.  Finally, unless
Spambayes implements some form of pruning old messages from the
database, there should be something in the instructions telling users
not to keep training on all Unsures, once satisfactory performance is
achieved.  This will cause the database to grow without bound and
probably without improvement in performance.  A step in this direction
might be to have incremental training off by default, with a warning
text box appearing if you turn it on.  The warning would explain what
can happen if you keep training on unsure spam indefinitely, and how to
avoid that.

--
Seth Goodman