[Python-Dev] Re: [Python-checkins] python/nondist/sandbox/spambayes GBayes.py,1.7,1.8

Eric S. Raymond esr@thyrsus.com
Sat, 24 Aug 2002 05:03:51 -0400


Tim Peters <tim.one@comcast.net>:
> a. There are other fudges in the code that may rely on this fudge
>    to cancel out, intentionally or unintentionally.  I'm loathe to
>    type more about this instead of working on the code, because I've
>    already typed about it.  See a later msg for a concrete example of
>    how the factor-of-2 "good count" bias acts in part to counter the
>    distortion here.  Take one away, and the other(s) may well become
>    "a problem".

I was thinking of shooting that "goodness bias" through the head and seeing
what happens, actually. I've been unhappy with that fudge in Paul's original
formula from the beginning.
 
> b. Unless the proportion of spam to not-spam in the training sets
>    is a good approximation to the real-life ratio of spam to not-
>    spam, it's also dubious to train the system with bogus P(S) and
>    P(not-S) values.

Right -- which is why I want to experiment with actually *using* the
real life running ratio.

> c. I'll get back to this when our testing infrastructure is trustworthy.
>    At the moment I'm hosed because the spam corpus I pulled off the
>    web turns out to be trivial to recognize in contrast to Barry's
>    corpus of good msgs from python.org mailing lists: 

Ouch.  That's a trap I'll have to watch out for in handling other
peoples' corpora.
-- 
		<a href="http://www.tuxedo.org/~esr/">Eric S. Raymond</a>