Re: [Python-Dev] Re: [Python-checkins] python/nondist/sandbox/spambayes GBayes.py,1.7,1.8

24 Aug 2002


      Tim Peters :
...
a. There are other fudges in the code that may rely on this fudge
   to cancel out, intentionally or unintentionally.  I'm loathe to
   type more about this instead of working on the code, because I've
   already typed about it.  See a later msg for a concrete example of
   how the factor-of-2 "good count" bias acts in part to counter the
   distortion here.  Take one away, and the other(s) may well become
   "a problem".
I was thinking of shooting that "goodness bias" through the head and seeing
what happens, actually. I've been unhappy with that fudge in Paul's original
formula from the beginning.
...
b. Unless the proportion of spam to not-spam in the training sets
   is a good approximation to the real-life ratio of spam to not-
   spam, it's also dubious to train the system with bogus P(S) and
   P(not-S) values.
Right -- which is why I want to experiment with actually *using* the
real life running ratio.
...
c. I'll get back to this when our testing infrastructure is trustworthy.
   At the moment I'm hosed because the spam corpus I pulled off the
   web turns out to be trivial to recognize in contrast to Barry's
   corpus of good msgs from python.org mailing lists:
Ouch.  That's a trap I'll have to watch out for in handling other
peoples' corpora.
-- 
		<a href="http://www.tuxedo.org/~esr/">Eric S. Raymond</a>