[Spambayes] Mozilla and summary of bayes calcs

Gary Robinson grobinson@transpose.com
Wed, 18 Sep 2002 18:03:15 -0400


Did you guys know they are looking at integrating a spam filter into
Mozilla? 

http://bugzilla.mozilla.org/show_bug.cgi?id=163188

They are looking at using the Graham approach.

To summarize my current thinking in one place, since my earlier thoughts
spanned several emails, the Graham approach implicitly does
TWO Bayesian calculations and then combines them. But the priors (expressed
by my a and b) are so overwhelmed by the actual data, that there is no point
in bothering with the priors at all. Nevertheless, they are really the same
calc as the Bayesian based on a beta distribution prior, just recognizing
that there is no point in bothering to have anything but zeroes as the beta
distribution parameters because of the amount of data.

Then those two calculations are combined.

That gives Paul's current "probabilities" for each word.

But after that step, we CAN take the number of occurrences of the word into
account (some words may only have so few occurrences that we can't really
say much about them -- or NO occurrences) with yet another Bayesian
calculation. 

This calc looks much like the one from my original essay, based on the beta
prior but actually needs a different justification, and I have a pretty
abstract one based on a Dirichlet prior although there may be other ways to
justify it.

To repeat, it's

        a + (n * p(w))
f(w) = ---------------
        (a * b) + n

where n is the number of occurrences of word w and p(w) is the current
graham-based calc.

You already account for the n=0 case by manually substituting .5, but this
accounts for low-data situations like n=1 and n=2 as well, when we don't
really have enough info to calculate a realistic probability. Obviously
we're forced to do something when n is 0 because there is no data; but in
reality we should do something when n is any low number -- 0 is just the
most extreme case. This is a smooth way of handling these low-n situations.


--Gary


-- 
Gary Robinson
CEO
Transpose, LLC
grobinson@transpose.com
207-942-3463
http://www.emergentmusic.com
http://radio.weblogs.com/0101454