[Spambayes] Effects of ham to spam ratio

Tue, 08 Oct 2002 17:45:02 -0400

[T. Alexander Popiel]
> Executive summary: more spam is VERY good.  1:4 ham:spam is
> _much_ more accurate than 4:1 ham:spam, or even 1:1 ham:spam.
>
> I'm back with another unusual experiment.  This time, I varied
> the ratio of ham to spam, while keeping the total number of
> messages trained and tested constant.  Once again, I'm doing
> this using the all-defaults Robinson classifier.  If someone
> gives me a good set of .ini files, I'd be more than happy to
> run this test using any of the central limit algorithms, too.

They're all the same, except for which one of

use_central_limit: True
use_central_limit2: True
use_central_limit3: True

you want to use.  Other than that, the spam cutoff ratio must be 0.5, and
the only semi-automated way to extract the 4 error rates (fp/fn when
certain/uncertain) is to set nbuckets to 4 and stare at the little
histograms.

> I again used timcv.py as my test driver, this time with 200
> messages in each ham/spam set.

How many sets (-n10, -n5, ...?).  Looks like 5.

>  For the different runs, I used the --{ham,spam}-keep options to
> control how much of each set got used, with the total used always
> being 250 ham+spam from each pair.  The script I used (along with
> all the run output, etc.) is on my website at:
>
>   http://www.wolfskeep.com/~popiel/spambayes/ratio
>
> I also mangled a version of cmp.py (now called table.py,
> also on the website) to generate the following output:
>
> -> <stat> tested 50 hams & 200 spams against 200 hams & 800 spams
> [... edited for brevity ...]
> -> <stat> tested 200 hams & 50 spams against 800 hams & 200 spams
>
> ham-spam:   50-200  75-175 100-150 125-125 150-100  175-75  200-50
> fp tot:          2       1       2       2       3       3       1
> fp %:         0.80    0.27    0.40    0.32    0.40    0.34    0.10
> fn tot:         12      17      20      28      28      30      36
> fn %:         1.20    1.94    2.67    4.48    5.60    8.00   14.40
> h mean:      28.80   25.01   22.57   20.83   19.80   18.74   16.59
> h sdev:       8.37    7.61    7.09    7.07    7.24    7.24    7.30
> s mean:      78.32   76.48   75.05   73.79   72.88   70.96   68.10
> s sdev:       7.87    8.36    8.82    9.28    9.77   10.36   10.86
> mean diff:   49.52   51.47   52.48   52.96   53.08   52.22   51.51
> k:            3.05    3.22    3.30    3.24    3.12    2.97    2.84
>
> There are several interesting things here:
>
> 1. The false positive rate remains insignificant throughout.
> 2. The false negative rate drops significantly as the ham:spam
>    ratio goes down.  The more spam you have in your mailfeed,
>    the better this whole thing works.

The reason isn't clear, though:  it may well have less to do with the ratio
than with the absolute quantity of spam trained on.  If there's sufficient
variety in your spam, it could simply be that 200 is way too few to get a
representative sampling of the diversity your spam, umm, enjoys <wink..

> 3. The ham:spam ratio affects the spam sdev much more than the
>    ham sdev.

Which is more reason to be suspicious:  sdev is a measure of how wild the
data is.  If the sdev gets steady as the absolute count increases, it means
the data is "settling down".  Your spam sdev goes up by about 0.50 in each
column, with no sign of settling down "to the left", which suggests that
even at the 50-200 extreme it's *still* finding plenty of new stuff in the
spam.

Do you have a lot of Asian spam?  The gimmicks we've got for that ("skip"
and "8bit%" meta-tokens) learn slowly, and that "skip" learns at all here is
just a lucky accident.

> 4. Tim's k value (mean separation divided by sum of standard
>    deviations) is best with slightly less ham than spam (at 2:3),
>    which happens to be about the same ratio as in my real mailfeed.
>
> It would be very interesting to find out if the best ham:spam
> ratio for k (#4 above) is constant, or if it's actually tied to
> the ratio in the real mail feed from which the training data is
> taken.  This may be hard to measure for people who are using
> corpora augmented from several sources.

It would be better <wink> to get independent results from the same kind of
test but run with more data.  I know that, for example, in my data, I have
to train on several thousand spam before the improvement in spam
identification slows to a crawl.

Thanks for the report, Alex!  Well down and provocative.