[Python-Dev] Re: [Python-checkins] python/nondist/sandbox/spambayes GBayes.py,1.7,1.8

Sat, 24 Aug 2002 01:26:12 -0400

>                P(S|X)*P(S|Y)/P(S)
> ---------------------------------------------------
> P(S|X)*P(S|Y)/P(S) + P(not-S|X)*P(not-S|Y)/P(not-S)

[Eric S. Raymond]
> ...
> OK.  So, maybe I'm just being stupid, but this seems easy to solve.
> We already *have* estimates of P(S) and P(not-S) -- we have a message
> count associated with both wordlists.  So why not use the running
> ratios between 'em?

a. There are other fudges in the code that may rely on this fudge
   to cancel out, intentionally or unintentionally.  I'm loathe to
   type more about this instead of working on the code, because I've
   already typed about it.  See a later msg for a concrete example of
   how the factor-of-2 "good count" bias acts in part to counter the
   distortion here.  Take one away, and the other(s) may well become
   "a problem".

b. Unless the proportion of spam to not-spam in the training sets
   is a good approximation to the real-life ratio of spam to not-
   spam, it's also dubious to train the system with bogus P(S) and
   P(not-S) values.

c. I'll get back to this when our testing infrastructure is trustworthy.
   At the moment I'm hosed because the spam corpus I pulled off the
   web turns out to be trivial to recognize in contrast to Barry's
   corpus of good msgs from python.org mailing lists:  every msg in the
   spam corpus has stuff about the fellow who collected the spam in the
   headers, while nothing in the python.org corpus does; contrarily,
   every msg in the python.org corpus has python.org header info not in
   the spam corpus headers.  This is an easy way to get 100% precision
   and 100% recall, but not particularly realistic -- the rules it's
   learning are of the form "it's spam if and only if it's addressed to
   bruceg"; "t's not spam if and only if the headers contain
   'List-Unsubscribe'"; etc.  The learning can't be faulted, but the
   teacher can <wink>.

d. I only exposed the math for the two-word case above, and the
   generalization to n words may not be clear from the final result
   (although it's clear enough if you back off a few steps).  If there
   are n words, w[0] thru w[n-1]:

   prod1 <- product for i in range(n) of P(S|w[i])/P(S)
   prod2 <- product for i in range (n) of (1-P(S|w[i])/(1-P(S))
   result <- prod1*P(S) / (prod1*P(S) + prod2*(1-P(S)))

   That's if you're better set up to experiment now.  If you do this,
   the most interesting thing to see is whether results get better or
   worse if you *also* get rid of the artificial "good count" boost by
   the factor of 2.

> As long as we initialize with "good" and "bad" corpora that are
> approximately the same size, the should work no worse than the
> equiprobability assumption.

"not spam" is already being given an artificial boost in a couple of ways.
Given that in real life most people still get more not-spam than spam,
removing the counter-bias in the scoring math may boost the false negative
rate.

> The ratios will correct in time based on incoming traffic.

Depends on how training is done.

> Oh, and do you mind if I use your algebra as part of bogofilter's
> documentation?

Not at all, although if you wait until we get our version of this ironed
out, you'll almost certainly be able to pull an anally-proofread version out
of a plain-text doc file I'll feel compelled to write <wink>.