[Python-Dev] Re: [Python-checkins] python/nondist/sandbox/spambayes GBayes.py,1.7,1.8
Eric S. Raymond
esr@thyrsus.com
Sat, 24 Aug 2002 00:44:16 -0400
Tim Peters <tim.one@comcast.net>:
> P(S|X)*P(S|Y)/P(S)
> ---------------------------------------------------
> P(S|X)*P(S|Y)/P(S) + P(not-S|X)*P(not-S|Y)/P(not-S)
>
> This isn't what Graham computes, though: the P(S) and P(not-S) terms are
> missing in his formulation. Given that P(not-S) = 1-P(S), and
> P(not-S|whatever) = 1-P(S|whatever), what he actually computes is
>
> P(S|X)*P(S|Y)
> -------------------------------------
> P(S|X)*P(S|Y) + P(not-S|X)*P(not-S|Y)
>
> This is the same as the Bayesian result only if P(S) = 0.5 (in which case
> all the instances of P(S) and P(not-S) cancel out). Else it's a distortion
> of the naive Bayesian result.
OK. So, maybe I'm just being stupid, but this seems easy to solve.
We already *have* estimates of P(S) and P(not-S) -- we have a message
count associated with both wordlists. So why not use the running
ratios between 'em?
As long as we initialize with "good" and "bad" corpora that are approximately
the same size, the should work no worse than the equiprobability assumption.
The ratios will correct in time based on incoming traffic.
Oh, and do you mind if I use your algebra as part of bogofilter's
documentation?
--
<a href="http://www.tuxedo.org/~esr/">Eric S. Raymond</a>