[Spambayes] Two Scheme Enter, One Scheme Leave.

Tim Peters tim.one@comcast.net
Wed, 25 Sep 2002 19:21:15 -0400


[Anthony Baxter, with lots of good experiments]
> ...
> Summarising values tried for robinson_probability_a
> (using the 0.6 cutoff)
>
>      a      fp       fn    fp+fn
>     0.0     18      650     668

Heh -- I thought 0.0 would be a disaster.

>     0.001   13       36      49
>     0.01    13       28      41
>     0.025   12       24      36
>     0.05    11       23      34
>     0.075   10       21      31
>     0.1      9       21      30
>     0.125   10       21      31
>     0.15    10       21      31
>     0.2      9       22      31
>     0.25     9       22      31
>     0.35    10       21      31
>     0.45    10       22      32
>     0.5     11       21      33     (tim's default)
>     1.0     13       29      42
>     2.0     12       42      54
>     10.0    11       96      107

I'll explain the 0.0 thing here.  The lower a, the more the system
"believes" the probability estimates gotten from just counting how many
times each word appears in ham and spam, and no matter how few (but
non-zero) times it sees a word in the training data.  At a=0, it believes
them totally, even to the extent of giving words spamprobs of exactly 0 (if
a word has been seen in at least one ham but in no spam) and 1 (flip flop).

Probabilities of 0 and 1 are certainties, and the math is such that if a
prob 0 and prob 1 word both appear, the score comes out to exactly 0.5.  If
you look at your histograms, I bet you'll find a lot of both ham and spam in
the 0.5 bucket.  But since your cutoff is above 0.5, they're all called ham,
and that's a disaster for the f-n rate.  If you cutoff were 0.49999999
instead, it would have been a disaster for the f-p rate.

Cute:  with a=0.0, I had one spam with all these "certain spam" words:

prob('subject:loan') = 1
prob('subject:Find') = 1
prob('from:email addr:link2buy.com>') = 1
prob('subject:rates') = 1
prob('url:o') = 1
prob('url:e') = 1
prob('url:u') = 1
prob('originator.') = 1
prob('url:r') = 1
prob('url:training') = 1
prob('mailing.') = 1
prob('unsubscribe') = 1
prob('url:img') = 1
prob('url:esf') = 1
prob('refinance') = 1
prob('rates') = 1
prob('url:4') = 1
prob('cash') = 1
prob('loan') = 1
prob('email name:training') = 1
prob('url:link2buy') = 1
prob('homeowners') = 1
prob('header:Received:8') = 1
prob('url:101700546') = 1
prob('from:skip:e 10') = 1

OTOH, it had one "certain ham" word too:

prob('clearing') = 0

That was enough to score it exactly 0.5.

A moral of the story is that allowing certainty to creep into an inherently
uncertain scheme is certain to screw up.

In the other direction, the higher a is, the more it pushes computed
probabilities toward x (the "unknown word" spamprob).

Perhaps revealing:  at a=0.0, I had 665 spam that scored 0.5, but 1235 ham
that scored 0.5.  So nearly half of all messages (more than half of my ham,
+ about 1/3 of my spam) contained at least one word that had only appeared
in the same kind of message in the training data, *and* at least one word
that had only appeared in the other kind of message.  That's a pretty
dramatic statement of just how flaky the probability estimates obtained from
counting alone can be.