[Spambayes] Sequemtial Test Results

Sat, 05 Oct 2002 12:32:59 -0700 (PDT)

I have a very unusual corpus of ham and spam compared
to "normal", so these results may not be widely
applicable.

In evaluating Graham and Spambayes I've used both
random testing (not as extensive as Spambayes) and
sequential testing (train on first N, test next M).
Since I use qmail, all of my mail is individual files
and the filenames are the delivery timestamp, so it's
easy to get an accurate sequence. In my experience,
testing sequentially (as above) has always been "worst
case" performance.

A few days ago I switched to simulating actual
performance, since I have to implement something sooner
or later. My test procedure is:

1. Train on first T msgs
2. Test next t msgs
3. Train (incrementally) on t msgs
4. Loop on 2 & 3 for N msgs

(all numbers are 50/50 spam/ham, which is my avg
receiving about 200 msgs/day)

For T = 8000, t = 200, N = 14400, the results I got for
Graham were (cutoff is independent of anything else, so
select the most desireable result):

                     (zero above)
cutoff: 0.15 -- fn =     0 (0.00%)  fp =     7 (0.33%)
cutoff: 0.16 -- fn =     0 (0.00%)  fp =     6 (0.29%)
cutoff: 0.17 -- fn =     0 (0.00%)  fp =     6 (0.29%)
cutoff: 0.18 -- fn =     0 (0.00%)  fp =     6 (0.29%)
cutoff: 0.19 -- fn =     0 (0.00%)  fp =     6 (0.29%)
cutoff: 0.20 -- fn =     0 (0.00%)  fp =     6 (0.29%)
cutoff: 0.21 -- fn =     0 (0.00%)  fp =     6 (0.29%)
cutoff: 0.22 -- fn =     0 (0.00%)  fp =     6 (0.29%)
cutoff: 0.23 -- fn =     0 (0.00%)  fp =     6 (0.29%)
cutoff: 0.24 -- fn =     0 (0.00%)  fp =     6 (0.29%)
cutoff: 0.25 -- fn =     0 (0.00%)  fp =     6 (0.29%)
cutoff: 0.26 -- fn =     0 (0.00%)  fp =     6 (0.29%)
cutoff: 0.27 -- fn =     0 (0.00%)  fp =     6 (0.29%)
cutoff: 0.28 -- fn =     0 (0.00%)  fp =     4 (0.19%)
cutoff: 0.29 -- fn =     0 (0.00%)  fp =     1 (0.05%)
cutoff: 0.30 -- fn =     0 (0.00%)  fp =     1 (0.05%)
cutoff: 0.31 -- fn =     0 (0.00%)  fp =     1 (0.05%)
cutoff: 0.32 -- fn =     0 (0.00%)  fp =     1 (0.05%)
cutoff: 0.33 -- fn =     0 (0.00%)  fp =     1 (0.05%)
cutoff: 0.34 -- fn =     0 (0.00%)  fp =     1 (0.05%)
------------------------------------------------------
cutoff: 0.35 -- fn =     0 (0.00%)  fp =     0 (0.00%)
cutoff: 0.36 -- fn =     0 (0.00%)  fp =     0 (0.00%)
cutoff: 0.37 -- fn =     0 (0.00%)  fp =     0 (0.00%)
cutoff: 0.38 -- fn =     0 (0.00%)  fp =     0 (0.00%)
cutoff: 0.39 -- fn =     0 (0.00%)  fp =     0 (0.00%)
cutoff: 0.40 -- fn =     0 (0.00%)  fp =     0 (0.00%)
cutoff: 0.41 -- fn =     0 (0.00%)  fp =     0 (0.00%)
cutoff: 0.42 -- fn =     0 (0.00%)  fp =     0 (0.00%)
cutoff: 0.43 -- fn =     0 (0.00%)  fp =     0 (0.00%)
cutoff: 0.44 -- fn =     0 (0.00%)  fp =     0 (0.00%)
cutoff: 0.45 -- fn =     0 (0.00%)  fp =     0 (0.00%)
cutoff: 0.46 -- fn =     0 (0.00%)  fp =     0 (0.00%)
cutoff: 0.47 -- fn =     0 (0.00%)  fp =     0 (0.00%)
------------------------------------------------------
cutoff: 0.48 -- fn =     1 (0.05%)  fp =     0 (0.00%)
cutoff: 0.49 -- fn =     1 (0.05%)  fp =     0 (0.00%)
cutoff: 0.50 -- fn =     2 (0.10%)  fp =     0 (0.00%)
cutoff: 0.51 -- fn =     2 (0.10%)  fp =     0 (0.00%)
cutoff: 0.52 -- fn =     2 (0.10%)  fp =     0 (0.00%)
cutoff: 0.53 -- fn =     2 (0.10%)  fp =     0 (0.00%)
cutoff: 0.54 -- fn =     2 (0.10%)  fp =     0 (0.00%)
cutoff: 0.55 -- fn =     2 (0.10%)  fp =     0 (0.00%)
cutoff: 0.56 -- fn =     2 (0.10%)  fp =     0 (0.00%)
cutoff: 0.57 -- fn =     2 (0.10%)  fp =     0 (0.00%)
cutoff: 0.58 -- fn =     2 (0.10%)  fp =     0 (0.00%)
cutoff: 0.59 -- fn =     3 (0.14%)  fp =     0 (0.00%)
cutoff: 0.60 -- fn =     3 (0.14%)  fp =     0 (0.00%)
cutoff: 0.61 -- fn =     3 (0.14%)  fp =     0 (0.00%)
cutoff: 0.62 -- fn =     3 (0.14%)  fp =     0 (0.00%)
cutoff: 0.63 -- fn =     6 (0.29%)  fp =     0 (0.00%)
cutoff: 0.64 -- fn =     6 (0.29%)  fp =     0 (0.00%)
cutoff: 0.65 -- fn =     7 (0.33%)  fp =     0 (0.00%)
cutoff: 0.66 -- fn =     7 (0.33%)  fp =     0 (0.00%)
cutoff: 0.67 -- fn =     8 (0.38%)  fp =     0 (0.00%)
cutoff: 0.68 -- fn =     9 (0.43%)  fp =     0 (0.00%)
cutoff: 0.69 -- fn =    10 (0.48%)  fp =     0 (0.00%)
cutoff: 0.70 -- fn =    10 (0.48%)  fp =     0 (0.00%)
cutoff: 0.71 -- fn =    10 (0.48%)  fp =     0 (0.00%)
cutoff: 0.72 -- fn =    10 (0.48%)  fp =     0 (0.00%)
cutoff: 0.73 -- fn =    10 (0.48%)  fp =     0 (0.00%)
cutoff: 0.74 -- fn =    15 (0.71%)  fp =     0 (0.00%)
                                       (zero below)

                          Graham
                       Spam    Ham
Mean                   0.98    0.01
Std Dev                0.04    0.02
3 sigma                0.86    0.07

For Spambayes ("out of the box" - CVS from 10/2)

cutoff: 0.41 -- fn =     0 (0.00%)  fp =   164 (2.13%)
cutoff: 0.42 -- fn =     1 (0.01%)  fp =   140 (1.82%)
cutoff: 0.43 -- fn =     1 (0.01%)  fp =   121 (1.57%)
cutoff: 0.44 -- fn =     1 (0.01%)  fp =   103 (1.34%)
cutoff: 0.45 -- fn =     1 (0.01%)  fp =    90 (1.17%)
cutoff: 0.46 -- fn =     1 (0.01%)  fp =    68 (0.88%)
cutoff: 0.47 -- fn =     2 (0.03%)  fp =    55 (0.71%)
cutoff: 0.48 -- fn =     2 (0.03%)  fp =    47 (0.61%)
cutoff: 0.49 -- fn =     2 (0.03%)  fp =    36 (0.47%)
cutoff: 0.50 -- fn =     3 (0.04%)  fp =    30 (0.39%)
cutoff: 0.51 -- fn =     5 (0.06%)  fp =    23 (0.30%)
cutoff: 0.52 -- fn =     8 (0.10%)  fp =    15 (0.19%)
cutoff: 0.53 -- fn =    11 (0.14%)  fp =    13 (0.17%)
cutoff: 0.54 -- fn =    15 (0.19%)  fp =     7 (0.09%)
cutoff: 0.55 -- fn =    18 (0.23%)  fp =     7 (0.09%)
cutoff: 0.56 -- fn =    28 (0.36%)  fp =     5 (0.06%)
cutoff: 0.57 -- fn =    36 (0.47%)  fp =     3 (0.04%)
cutoff: 0.58 -- fn =    46 (0.60%)  fp =     2 (0.03%)
cutoff: 0.59 -- fn =    55 (0.71%)  fp =     2 (0.03%)
cutoff: 0.60 -- fn =    63 (0.82%)  fp =     2 (0.03%)
cutoff: 0.61 -- fn =    73 (0.95%)  fp =     1 (0.01%)
cutoff: 0.62 -- fn =    90 (1.17%)  fp =     0 (0.00%)

                        Spambayes
                       Spam    Ham
Mean                   0.85    0.16
Std Dev                0.10    0.10
3 sigma                0.54    0.46

For Graham, the modifications made are:

1. Word freq threshhold = 1 instead of 5
2. Case sensitive tokeninzing
3. Use Gary Robinson's score calculation
4. Use token count instead of msg count in
computing probability.

Counting msgs instead of tokens in computing probability is
a fairly subtle bias (noted by Graham in "A Plan for Spam")
and is still included in Spambayes. If I count msgs instead
of tokens I can get about the same results and the mean and
std dev are unaffected, but the tails of the distributions
for ham/spam scores move closer together (no large dead
band as above). Here's why (sort of):

The probability calculation is:

(s is spam count for a token, h is ham count, H/S are either
the number of msgs seen or number of tokens seen)

prob = (s/S)/((h/H) + (s/S))

which can be refactored to:

prob = 1/(1 + (S/H)*(h/s))

or with Graham's bias:

prob = 1/(1 + (S/H)*(2*h/s))

For my mail/testing,

msgs   -- S/H = 1
tokens -- S/H ~= 0.5 (ranges from 0.40 to 0.52 over time)

so (for my unusual data anyway) counting msgs doubles the bias on
the ham probability, but surprisingly affects the shape of my
score distributions adversely. If I count msgs and remove Graham's
2.0 bias I get only slightly worse results than if I count tokens
and include Graham's bias, since they're almost the same calculation
and the sensitivity to S/H is small but noticeable. Playing around
with Spambayes, I get slightly better results if I a) count tokens;
b) count every token; c) drop robinson_probability_s to .05, but I
still have overlap on the score distribution tails. (Adding Graham's
bias back in helps too). Nothing I did to Spambayes had much effect
on mean/std dev, but did reshape the distribution curves. I get
a lot more tokens than Spambayes, but the ratios are close. Process
sizes are comparable (about 100MB peak for the tests above).
Spambayes is about 2X faster.

For my data (which, again, is unusual) I'd conclude:

1. Counting msgs and counting tokens once per msg seems wrong to
me. It seems to me to be enumerating containers rather than
enumerating contents, or at least mixing the two.

2. Sequential testing/training is important to look at (there
may be time related effects - certainly S/H (counting tokens)
varies over time). These are better than any other test results
I've had for either method.

3. I'd concentrate on shaping the tails of the distribution
rather than worrying about mean and std dev. Some adjustments
will degrade the mean/std dev but improve the shape of the
distribution/sharpness of discrimination. If you look at the
3 sigma limits on either method (covers about 99.7% of the
distribution in theory), the fns and fps are out past 3 sigma.
In EE terms, you want sharper rolloff, not necessarily higher
Q or a change in center frequency. Graham appears to be less
sensitive to choice of cutoff than Spambayes for my dataset.

As far as traing sample size, the results above were based on
an intial training sample of 8000 msgs. If I start with 200
msgs, I get an additional 15 fns in the first 200 tested, and
one additional fn through the rest of the first 8000 msgs,
at which time I'm back to the test/results above (choosing
a fairly optimum cutoff value). If I start with 1 ham and
1 spam, I get 89 fps on the first 200 and no more failures
after that. It seems to converge. Not very sensitive to
initial conditions.

All of this might only work for my email. YMMV. For either
method the results are fantastic. I'd be happy with a 90%
spam reduction and no fps.

Jim