[Spambayes] Sequemtial Test Results

Tim Peters tim.one@comcast.net
Sat, 05 Oct 2002 23:36:25 -0400


[Jim Bublitz]

Thanks for sharing this!  It's an excellent report.

> I have a very unusual corpus of ham and spam compared
> to "normal", so these results may not be widely
> applicable.

Could you say something about *what* makes you abnormal <wink>?

> In evaluating Graham and Spambayes I've used both
> random testing (not as extensive as Spambayes) and
> sequential testing (train on first N, test next M).
> Since I use qmail, all of my mail is individual files
> and the filenames are the delivery timestamp, so it's
> easy to get an accurate sequence. In my experience,
> testing sequentially (as above) has always been "worst
> case" performance.
>
> A few days ago I switched to simulating actual
> performance, since I have to implement something sooner
> or later.

Excellent -- nobody here has done that yet (that I know of), and I've
worried out loud about that randomization allows msgs to get benefit from
training msgs that appeared *after* them in time; e.g., a ham msg can be
helped by that a reply to it appeared in the training ham, but that can
never happen in real life.

> My test procedure is:
>
> 1. Train on first T msgs
> 2. Test next t msgs
> 3. Train (incrementally) on t msgs
> 4. Loop on 2 & 3 for N msgs
>
> (all numbers are 50/50 spam/ham, which is my avg
> receiving about 200 msgs/day)
>
> For T = 8000, t = 200, N = 14400, the results I got for
> Graham were (cutoff is independent of anything else, so
> select the most desireable result):

I'm not sure what these are results *of* -- like, the last time you ran step
#2?  An average over all times you ran step #2?

>                      (zero above)
> cutoff: 0.15 -- fn =     0 (0.00%)  fp =     7 (0.33%)
> cutoff: 0.16 -- fn =     0 (0.00%)  fp =     6 (0.29%)
> cutoff: 0.17 -- fn =     0 (0.00%)  fp =     6 (0.29%)
> cutoff: 0.18 -- fn =     0 (0.00%)  fp =     6 (0.29%)
> cutoff: 0.19 -- fn =     0 (0.00%)  fp =     6 (0.29%)
> cutoff: 0.20 -- fn =     0 (0.00%)  fp =     6 (0.29%)
> cutoff: 0.21 -- fn =     0 (0.00%)  fp =     6 (0.29%)
> cutoff: 0.22 -- fn =     0 (0.00%)  fp =     6 (0.29%)
> cutoff: 0.23 -- fn =     0 (0.00%)  fp =     6 (0.29%)
> cutoff: 0.24 -- fn =     0 (0.00%)  fp =     6 (0.29%)
> cutoff: 0.25 -- fn =     0 (0.00%)  fp =     6 (0.29%)
> cutoff: 0.26 -- fn =     0 (0.00%)  fp =     6 (0.29%)
> cutoff: 0.27 -- fn =     0 (0.00%)  fp =     6 (0.29%)
> cutoff: 0.28 -- fn =     0 (0.00%)  fp =     4 (0.19%)
> cutoff: 0.29 -- fn =     0 (0.00%)  fp =     1 (0.05%)
> cutoff: 0.30 -- fn =     0 (0.00%)  fp =     1 (0.05%)
> cutoff: 0.31 -- fn =     0 (0.00%)  fp =     1 (0.05%)
> cutoff: 0.32 -- fn =     0 (0.00%)  fp =     1 (0.05%)
> cutoff: 0.33 -- fn =     0 (0.00%)  fp =     1 (0.05%)
> cutoff: 0.34 -- fn =     0 (0.00%)  fp =     1 (0.05%)
> ------------------------------------------------------
> cutoff: 0.35 -- fn =     0 (0.00%)  fp =     0 (0.00%)
> cutoff: 0.36 -- fn =     0 (0.00%)  fp =     0 (0.00%)
> cutoff: 0.37 -- fn =     0 (0.00%)  fp =     0 (0.00%)
> cutoff: 0.38 -- fn =     0 (0.00%)  fp =     0 (0.00%)
> cutoff: 0.39 -- fn =     0 (0.00%)  fp =     0 (0.00%)
> cutoff: 0.40 -- fn =     0 (0.00%)  fp =     0 (0.00%)
> cutoff: 0.41 -- fn =     0 (0.00%)  fp =     0 (0.00%)
> cutoff: 0.42 -- fn =     0 (0.00%)  fp =     0 (0.00%)
> cutoff: 0.43 -- fn =     0 (0.00%)  fp =     0 (0.00%)
> cutoff: 0.44 -- fn =     0 (0.00%)  fp =     0 (0.00%)
> cutoff: 0.45 -- fn =     0 (0.00%)  fp =     0 (0.00%)
> cutoff: 0.46 -- fn =     0 (0.00%)  fp =     0 (0.00%)
> cutoff: 0.47 -- fn =     0 (0.00%)  fp =     0 (0.00%)
> ------------------------------------------------------
> cutoff: 0.48 -- fn =     1 (0.05%)  fp =     0 (0.00%)
> cutoff: 0.49 -- fn =     1 (0.05%)  fp =     0 (0.00%)
> cutoff: 0.50 -- fn =     2 (0.10%)  fp =     0 (0.00%)
> cutoff: 0.51 -- fn =     2 (0.10%)  fp =     0 (0.00%)
> cutoff: 0.52 -- fn =     2 (0.10%)  fp =     0 (0.00%)
> cutoff: 0.53 -- fn =     2 (0.10%)  fp =     0 (0.00%)
> cutoff: 0.54 -- fn =     2 (0.10%)  fp =     0 (0.00%)
> cutoff: 0.55 -- fn =     2 (0.10%)  fp =     0 (0.00%)
> cutoff: 0.56 -- fn =     2 (0.10%)  fp =     0 (0.00%)
> cutoff: 0.57 -- fn =     2 (0.10%)  fp =     0 (0.00%)
> cutoff: 0.58 -- fn =     2 (0.10%)  fp =     0 (0.00%)
> cutoff: 0.59 -- fn =     3 (0.14%)  fp =     0 (0.00%)
> cutoff: 0.60 -- fn =     3 (0.14%)  fp =     0 (0.00%)
> cutoff: 0.61 -- fn =     3 (0.14%)  fp =     0 (0.00%)
> cutoff: 0.62 -- fn =     3 (0.14%)  fp =     0 (0.00%)
> cutoff: 0.63 -- fn =     6 (0.29%)  fp =     0 (0.00%)
> cutoff: 0.64 -- fn =     6 (0.29%)  fp =     0 (0.00%)
> cutoff: 0.65 -- fn =     7 (0.33%)  fp =     0 (0.00%)
> cutoff: 0.66 -- fn =     7 (0.33%)  fp =     0 (0.00%)
> cutoff: 0.67 -- fn =     8 (0.38%)  fp =     0 (0.00%)
> cutoff: 0.68 -- fn =     9 (0.43%)  fp =     0 (0.00%)
> cutoff: 0.69 -- fn =    10 (0.48%)  fp =     0 (0.00%)
> cutoff: 0.70 -- fn =    10 (0.48%)  fp =     0 (0.00%)
> cutoff: 0.71 -- fn =    10 (0.48%)  fp =     0 (0.00%)
> cutoff: 0.72 -- fn =    10 (0.48%)  fp =     0 (0.00%)
> cutoff: 0.73 -- fn =    10 (0.48%)  fp =     0 (0.00%)
> cutoff: 0.74 -- fn =    15 (0.71%)  fp =     0 (0.00%)
>                                        (zero below)
>
>
>                           Graham
>                        Spam    Ham
> Mean                   0.98    0.01

And these are the means of what?  For example, there's no false-negative
rate as large as 0.98 in the table above, so 0.98 certainly isn't the mean
of the table entries.

> Std Dev                0.04    0.02
> 3 sigma                0.86    0.07
>
>
> For Spambayes ("out of the box" - CVS from 10/2)
>
>
> cutoff: 0.41 -- fn =     0 (0.00%)  fp =   164 (2.13%)
> cutoff: 0.42 -- fn =     1 (0.01%)  fp =   140 (1.82%)
> cutoff: 0.43 -- fn =     1 (0.01%)  fp =   121 (1.57%)
> cutoff: 0.44 -- fn =     1 (0.01%)  fp =   103 (1.34%)
> cutoff: 0.45 -- fn =     1 (0.01%)  fp =    90 (1.17%)
> cutoff: 0.46 -- fn =     1 (0.01%)  fp =    68 (0.88%)
> cutoff: 0.47 -- fn =     2 (0.03%)  fp =    55 (0.71%)
> cutoff: 0.48 -- fn =     2 (0.03%)  fp =    47 (0.61%)
> cutoff: 0.49 -- fn =     2 (0.03%)  fp =    36 (0.47%)
> cutoff: 0.50 -- fn =     3 (0.04%)  fp =    30 (0.39%)
> cutoff: 0.51 -- fn =     5 (0.06%)  fp =    23 (0.30%)
> cutoff: 0.52 -- fn =     8 (0.10%)  fp =    15 (0.19%)
> cutoff: 0.53 -- fn =    11 (0.14%)  fp =    13 (0.17%)
> cutoff: 0.54 -- fn =    15 (0.19%)  fp =     7 (0.09%)
> cutoff: 0.55 -- fn =    18 (0.23%)  fp =     7 (0.09%)
> cutoff: 0.56 -- fn =    28 (0.36%)  fp =     5 (0.06%)
> cutoff: 0.57 -- fn =    36 (0.47%)  fp =     3 (0.04%)
> cutoff: 0.58 -- fn =    46 (0.60%)  fp =     2 (0.03%)
> cutoff: 0.59 -- fn =    55 (0.71%)  fp =     2 (0.03%)
> cutoff: 0.60 -- fn =    63 (0.82%)  fp =     2 (0.03%)
> cutoff: 0.61 -- fn =    73 (0.95%)  fp =     1 (0.01%)
> cutoff: 0.62 -- fn =    90 (1.17%)  fp =     0 (0.00%)
>
>
>                         Spambayes
>                        Spam    Ham
> Mean                   0.85    0.16
> Std Dev                0.10    0.10
> 3 sigma                0.54    0.46


> For Graham, the modifications made are:
>
> 1. Word freq threshhold = 1 instead of 5

That helped us a lot when we were using Graham.

> 2. Case sensitive tokeninzing

That did not (made no overall difference in error rates; it systematically
called conference announcements spam, but was better at distinguishing spam
screaming about MONEY from casual mentions of money in ham).

> 3. Use Gary Robinson's score calculation

With or without artificially clamping spamprobs into [0.01, 0.99] first (as
Graham does)?

> 4. Use token count instead of msg count in computing probability.

We haven't tried that.

> Counting msgs instead of tokens in computing probability is
> a fairly subtle bias (noted by Graham in "A Plan for Spam")
> and is still included in Spambayes.

Not really.  We currently depart from Graham too in counting multiple
occurrences of a word only once in both training and scoring.  Our hamcounts
and spamcounts are counts of the # of messages a word appears in now, not
counts of the total number of times the word appears in msgs (as they were
under Graham).

> If I count msgs instead of tokens I can get about the same results
> and the mean and std dev are unaffected, but the tails of the
> distributions for ham/spam scores move closer together (no large
> dead band as above). Here's why (sort of):
>
> The probability calculation is:
>
> (s is spam count for a token, h is ham count, H/S are either
> the number of msgs seen or number of tokens seen)

I'm not sure what "spam count for a token" means.  For Graham, it means the
total number of times a token appears in spam, regardless of msg boundaries.
For us today, it means the number of spams in which the token appears (and
"Nigeria" appearing 100 times in a single spam adds only 1 to Nigeria's spam
count for us; it adds 100 to Graham's Nigeria spam count).  Our error rates
got lower when we made training symmetric with scoring in this respect,
although that wasn't true before we purged *all* of the deliberate biases in
Paul's scheme.

> prob = (s/S)/((h/H) + (s/S))
>
> which can be refactored to:
>
> prob = 1/(1 + (S/H)*(h/s))
>
> or with Graham's bias:
>
> prob = 1/(1 + (S/H)*(2*h/s))

Did you keep Graham's ham bias?  We have not.

> For my mail/testing,
>
> msgs   -- S/H = 1
> tokens -- S/H ~= 0.5 (ranges from 0.40 to 0.52 over time)
>
> so (for my unusual data anyway) counting msgs doubles the bias on
> the ham probability, but surprisingly affects the shape of my
> score distributions adversely. If I count msgs and remove Graham's
> 2.0 bias I get only slightly worse results than if I count tokens
> and include Graham's bias, since they're almost the same calculation
> and the sensitivity to S/H is small but noticeable. Playing around
> with Spambayes, I get slightly better results if I a) count tokens;
> b) count every token; c) drop robinson_probability_s to .05, but I
> still have overlap on the score distribution tails. (Adding Graham's
> bias back in helps too).

Note that overlapping tails aren't something our default scheme tries to
eliminate.  It's considered "a feature" here that Gary's scheme has a middle
ground where mistakes are very likely to live.  This is something you learn
to love <wink> after realizing that mistakes cannot be stopped.

For example, under Graham's scheme, you're *eventually* going to find ham
that scores 1.0 (and spam that scores 0.0).  For example, with 15
discriminators, sooner or later you're going to find a ham that just happens
to have 8 .99 clues and 7 .01 clues, and then Graham is certain it's spam.

There's no cutoff value that can save you from this kind of false positive,
short of never calling anything spam.  When Gary's scheme makes a mistake,
it's almost always within a short distance of the data's best spam_cutoff
value.  In a system with manual human review, this is very exploitable; in a
system without manual review, I suppose you just pass such msgs on, but
still have the *possibility* to say clearly that the system is known to make
mistakes in this range.

> Nothing I did to Spambayes had much effect on mean/std dev, but did
> reshape the distribution curves.  I get a lot more tokens than
> Spambayes,

?  What does that mean?  If you're using spambayes, it's generating tokens,
so it seems hard to get a lot more than that <wink>.

> but the ratios are close. Process sizes are comparable (about 100MB
> peak for the tests above).  Spambayes is about 2X faster.
>
> For my data (which, again, is unusual) I'd conclude:
>
> 1. Counting msgs and counting tokens once per msg seems wrong to
> me. It seems to me to be enumerating containers rather than
> enumerating contents, or at least mixing the two.

Graham also *scores* tokens (at most) once per message.  The training we do
matches the way our scoring uses the information produced by training.
We've seen reason to believe that the density of a word in a msg does
contain exploitable information, but don't have a way to exploit it;
experiment showed that using this info to distort spamprobs was not a useful
way to exploit it; for now, we just ignore it.

> 2. Sequential testing/training is important to look at (there
> may be time related effects - certainly S/H (counting tokens)
> varies over time).

No argument, and we really need to test that too.

> These are better than any other test results I've had for either
> method.
>
> 3. I'd concentrate on shaping the tails of the distribution
> rather than worrying about mean and std dev.

The so-called central-limit schemes we're investigating now are almost
entirely about separating the tails, and *knowing* when we can't, so that
should give you cause for hope.

> Some adjustments will degrade the mean/std dev but improve the
> shape of the distribution/sharpness of discrimination. If you
> look at the 3 sigma limits on either method (covers about 99.7%
> of the distribution in theory),

I've got no reason to assume the distributions here are normal, and short of
that you're reduced to Chebyshev's inequality (that at least 8/9ths of the
data lives inside 3 sigmas, regardless of distribution).  That said, they
"look pretty normal" <wink>, apart indeed from the long dribbly tails.
OTOH, some ham and some spam simply aren't clearcut, even for human
judgment, so I see no hope that this can be wholly eliminated "even in
theory".

> the fns and fps are out past 3 sigma.  In EE terms, you want sharper
> rolloff, not necessarily higher Q or a change in center frequency.
> Graham appears to be less sensitive to choice of cutoff than
> Spambayes for my dataset.

This was universally observed:  the Graham score histograms approximated two
solid bars, one at 0.0, the other at 1.0, the more data it was trained on.
Unfortunately, its *mistakes* also lived on these bars.

> As far as traing sample size, the results above were based on
> an intial training sample of 8000 msgs. If I start with 200
> msgs, I get an additional 15 fns in the first 200 tested, and
> one additional fn through the rest of the first 8000 msgs,
> at which time I'm back to the test/results above (choosing
> a fairly optimum cutoff value). If I start with 1 ham and
> 1 spam, I get 89 fps on the first 200 and no more failures
> after that. It seems to converge. Not very sensitive to
> initial conditions.

That's been my experience too, and that it takes a very large increase in
training data to cut an error rate in half.  My corpus now is at the point
where it can't really improve, due to the "some ham and spam simply aren't
clearcut" reason above.  It would take telepathy, and even people on this
list argue about whether specific msgs are ham or spam.

> All of this might only work for my email. YMMV. For either
> method the results are fantastic. I'd be happy with a 90%
> spam reduction and no fps.

You can get a 90% spam reduction, but you won't be as happy with that when
the amount of spam you get increases by a factor of 10 again.  But there's
no way you can get no fps, short of calling nothing spam.  I've noted before
that the chance my classifier would produce an FP over the next year is
smaller than the chance I'll die in that time, and I personally don't fear a
false positive more than death <wink>.