[Spambayes] Any prospect of spambayes working with qmail?

Thu Feb 20 21:20:38 EST 2003

"T. Alexander Popiel" <popiel at wolfskeep.com> writes:

> Yes.  It'd also be a great source for rules for my testing harness.
> If you make the doc, I may be able to provide graphs of accuracy to
> go with it...

OK, here's what I came up with. I rethought a bit, based on the fact
that I started to consider "if the system is accurate enough, why
train at all?" So I've probably stressed the fact that you don't need
to train after a certain point more than I would have a day or two
ago...

Also, I don't have much experience with automating training, so I may
have missed some possibilities there.

But here's what I have, for what it's worth.

----------------------------------------------------------------------

Training Methods for Spambayes
==============================

General training issues
-----------------------

In order to get good results from the spambayes system, it is
necessary to train it. As the system is trained, it gains an
"understanding" of what you, personally, consider to be ham and spam,
and bases its decisions on this understanding.

It is *not* necessary to continue training indefinitely. Once the
system is giving reliable results, it is perfectly acceptable to stop
training, except to correct the system's mistakes, or to train the
system on new categories of spam. (Or ham - if you subscribe to a new
type of newsletter, the system may initially guess incorrectly, if the
newsletter has similar characteristics to mail you previously trained
on as spam. Training on the first few newsletters should correct the
system pretty quickly).

While there are a number of training techniques discussed below, it
should be noted that no training method has been shown to
significantly degrade the performance of the system - results are
generally excellent with even the most minimal training.

Initial training
----------------

Before the system can start classifying mails, it needs some training. 
When the system is installed, there are basically two possibilities
for the initial training:

1. Do nothing. The system will initially classify everything as
   "unsure".

2. Train on sample collections of ham and spam. In this case,
   careful selection of the initial training set is important. The
   system can easily pick up on unintended clues. For example, if you
   train on a batch of recent spam, and on the contents of your inbox,
   the system could decide that the best spam clue is the message date
   - new mails are spam!

Ongoing training
----------------

Once the system is running, there are a number of possible approaches
to training. These approaches vary in the level of manual intervention
required, and potentially in the accuracy of the results (although, as
mentioned above, no training method seems to produce particularly bad
results).

1. Train on everything. Check and train on every message received,
   regardless of whether the system classified it correctly or not.
   While this is a very manual chore, it is eased by the fact that the
   system does classify mail. However, it does still require manual
   scanning of the spam folder.

2. Train automatically on what the system classifies as spam or ham,
   and manually on unsures. This approach tends to reinforce any
   mistakes the system makes. Retraining of false negatives (spam
   incorrectly classified as ham) and false positives (ham incorrectly
   classified as spam) helps, but converts this method into a
   variation on the "train everything" approach.

3. Train on mistakes and unsures only. Anything correctly classified
   can be left alone.

4. Train on mistakes only. If the level of unsures is low enough, it
   may not be worth training on them - particularly if it is difficult
   to decide how to classify them even by hand.

5. Don't train. This assumes that the system's decisions have reached
   an acceptable level of accuracy.

In general, as the system stabilises, any training approach (other
than automated approaches such as (2) above) is likely to tend towards
the "don't train" option.

----------------------------------------------------------------------

Paul.

-- 
This signature intentionally left blank