[Spambayes] Perhaps a level header would be useful?
tim.one at comcast.net
Tue Mar 11 20:33:25 EST 2003
> The change I made was to replace line 245 ("prob = (S-H + 1.0) /
> 2.0") of classifier.py with:
> from math import log
> if H == 0:
> H = 0.00000001
> if S == 0:
> S = 0.00000001
> prob = ((-(log(S) - log(H)))/350) + 0.5
Apart from the technical glitches you bumped into, there's a reason we don't
want to combine H and S via any expression of this form. Because the
difference of logs is the log of the quotient, and the negation of a log is
the log of the reciprocal, the heart of this expression is log(H/S), and
it's the H/S part that's undesirable.
If, say, H is 0.99, and S is 0.0099, H/S is 100 and there's no problem with
concluding that we're sure the msg is ham.
But suppose H is .0001 and S is .000001. Then H/S is also 100, but it's
plain nuts to be exactly as sure that the msg is ham: H on its own says the
system thinks there's virtually no chance the msg looks like what it's been
taught about ham, and the low S says the same about what it's been taught
about spam: it doesn't look like either, so Unsure is the "proper"
response. If the system *had* to guess one or the other, then ham is the
best guess it can make, but H on its own says the system doesn't believe
that guess. (Note that in pH calculations, small magnitudes don't "say"
anything significant -- a factor of 100 is equally signficant in that domain
no matter how small the input magnitudes.)
Rob Hooft crafted the simple combining formula we use to give a high
combined score in the first example and a solid Unsure in the second
example. We used a different expression involving a ratio before that, and
examples of the second kind are exactly where it screwed up. Don't want to
do that again <wink>.
BTW, and IIRC, cmp.py never got updated to deal sensibly with unsures. If
that's right, it shouldn't be used except when spam_cutoff == ham_cutoff.
Then you've got a two-outcome classifier (no unsures), and cmp.py won't
"forget" any msgs.
More information about the Spambayes