[Python-Dev] Re: [Python-checkins]
python/nondist/sandbox/spambayes GBayes.py,1.7,1.8
Tim Peters
tim.one@comcast.net
Thu, 22 Aug 2002 00:49:08 -0400
[Eric S. Raymond]
> Your users' mailers would have two delete buttons -- spam and nonspam.
> On each delete the message would be shipped to bogofilter, which would
> would merge the content into its token lists.
I want to raise a caution here. Graham pulled his formulas out of thin air,
and one part of the scoring step is quite dubious. This requires detail to
understand. Where X means "word x is present" and similarly for Y, and S
means "it's spam" and "not-S" means "it's not spam", and sticking to just
the two-word case for simplicity:
P(S | X and Y) = [by Bayes]
P(X and Y | S) * P(S) / P(X and Y) = [by the usual expanded form of Bayes]
P(X and Y | S) * P(S) / (P(S)*P(X and Y | S) + P(not-S)*P(X and Y | not-S))
All that is rigorously correct so far. Now we make the simplifying
assumption that puts the "naive" in "naive Bayes", that the probability of X
is independent of the probability of Y, so that the conjoined probabilities
can be replaced by multiplication of non-conjoined probabilities. This
yields
P(X|S)*P(Y|S)*P(S)
---------------------------------------------------
P(S)*P(X|S)*P(Y|S) + P(not-S)*P(X|not-S)*P(Y|not-S)
Then, unlike a "normal" formulation of Bayesian classification, Graham's
scheme simply doesn't know anything about P(X|S) and P(Y|S) etc. It only
knows about probabilities in the other direction (P(S|X) etc). It takes 3
more applications of Bayes to get what we want from what we know. That is,
P(X|S) = [again by Bayes]
P(S|X) * P(X) / P(S)
Plug that in, mutatis mutandis, in six places, to get
P(S|X)*P(X)/P(S)*P(S|Y)*P(Y)/P(S)*P(S)
---------------------------------------------------
P(S)*P(S|X)*P(X)/P(S)*P(S|Y)*P(Y)/P(S) + ...
P(not-S)*P(not-S|X)*P(X)/P(not-S)*P(not-S|Y)*P(Y)/P(not-S)
The factor P(X)*P(Y) cancels out of numerator and denominator, leaving
P(S|X)*/P(S)*P(S|Y)*/P(S)*P(S)
---------------------------------------------------
P(S)*P(S|X)/P(S)*P(S|Y)/P(S) + ...
P(not-S)*P(not-S|X)/P(not-S)*P(not-S|Y)/P(not-S)
and simplifying some P(whatever)/P(whatever) instances away gives
P(S|X)*P(S|Y)/P(S)
---------------------------------------------------
P(S|X)*P(S|Y)/P(S) + P(not-S|X)*P(not-S|Y)/P(not-S)
This isn't what Graham computes, though: the P(S) and P(not-S) terms are
missing in his formulation. Given that P(not-S) = 1-P(S), and
P(not-S|whatever) = 1-P(S|whatever), what he actually computes is
P(S|X)*P(S|Y)
-------------------------------------
P(S|X)*P(S|Y) + P(not-S|X)*P(not-S|Y)
This is the same as the Bayesian result only if P(S) = 0.5 (in which case
all the instances of P(S) and P(not-S) cancel out). Else it's a distortion
of the naive Bayesian result.
For this reason, it's best that the number of spam msgs fed into your
database be approximately equal to the number of non-spam msgs fed into it:
that's the only way to make P(S) ~= P(not-S), so that the distortion doesn't
matter. Indeed, it may be that Graham found he had to multiply his "good
counts" by 2 in order to make up for that in real life he has twice as many
non-spam messages as spam messages in his inbox, but that the final scoring
implicitly assumes they're of equal number (and so overly favors the "it's
spam" outcome unless the math is fudged elsewhere to make up for that).
It would likely be better still to train the database with a proportion of
spam to not-spam messages reflecting what you actually get in your inbox,
and change the scoring to use the real-life P(S) and P(not-S) estimates. In
that case the "mystery bias" of 2 may actively hurt, overly favoring the
"not spam" outcome.
Note that Graham said:
Here's a sketch of how I do statistical filtering. I start with one
corpus of spam and one of nonspam mail. At the moment each one has
about 4000 messages in it.
That's consistent with all the above, although it's unclear whether Graham
intended "about the same" to be a precondition for using this formulation,
or whether fudging elsewhere was introduced empirically to make up for the
scoring formula neglecting P(S) and P(not-S) by oversight.