[Python-Dev] Re: [Python-checkins] python/nondist/sandbox/spambayes GBayes.py,1.7,1.8
Eric S. Raymond
esr@thyrsus.com
Sat, 24 Aug 2002 05:03:51 -0400
Tim Peters <tim.one@comcast.net>:
> a. There are other fudges in the code that may rely on this fudge
> to cancel out, intentionally or unintentionally. I'm loathe to
> type more about this instead of working on the code, because I've
> already typed about it. See a later msg for a concrete example of
> how the factor-of-2 "good count" bias acts in part to counter the
> distortion here. Take one away, and the other(s) may well become
> "a problem".
I was thinking of shooting that "goodness bias" through the head and seeing
what happens, actually. I've been unhappy with that fudge in Paul's original
formula from the beginning.
> b. Unless the proportion of spam to not-spam in the training sets
> is a good approximation to the real-life ratio of spam to not-
> spam, it's also dubious to train the system with bogus P(S) and
> P(not-S) values.
Right -- which is why I want to experiment with actually *using* the
real life running ratio.
> c. I'll get back to this when our testing infrastructure is trustworthy.
> At the moment I'm hosed because the spam corpus I pulled off the
> web turns out to be trivial to recognize in contrast to Barry's
> corpus of good msgs from python.org mailing lists:
Ouch. That's a trap I'll have to watch out for in handling other
peoples' corpora.
--
<a href="http://www.tuxedo.org/~esr/">Eric S. Raymond</a>