Tim Peters
a. There are other fudges in the code that may rely on this fudge to cancel out, intentionally or unintentionally. I'm loathe to type more about this instead of working on the code, because I've already typed about it. See a later msg for a concrete example of how the factor-of-2 "good count" bias acts in part to counter the distortion here. Take one away, and the other(s) may well become "a problem".
I was thinking of shooting that "goodness bias" through the head and seeing what happens, actually. I've been unhappy with that fudge in Paul's original formula from the beginning.
b. Unless the proportion of spam to not-spam in the training sets is a good approximation to the real-life ratio of spam to not- spam, it's also dubious to train the system with bogus P(S) and P(not-S) values.
Right -- which is why I want to experiment with actually *using* the real life running ratio.
c. I'll get back to this when our testing infrastructure is trustworthy. At the moment I'm hosed because the spam corpus I pulled off the web turns out to be trivial to recognize in contrast to Barry's corpus of good msgs from python.org mailing lists:
Ouch. That's a trap I'll have to watch out for in handling other peoples' corpora. -- <a href="http://www.tuxedo.org/~esr/">Eric S. Raymond</a>