[Spambayes] Getting rid of max_spamprob and min_spamprob

Sun, 15 Sep 2002 20:06:56 -0400

[Neil Schemenauer]
> I don't like the max_spamprob and min_spamprob limits.

They seem mostly to prevent .spamprobs of 0.0 or 1.0 from getting created
when a word appears in only one corpus.  Any msg with a spamprob 0.0 word
(but not also a spamprob 1.0 word) will get rated as 0; any with a spamprob
1.0 word (but not also a spamprob 0.0 word) will get rated as 1; and, if a
msg has both kinds, ZeroDivisionError will occur.  All of that is what
*should* happen with probabilities of 0 and 1, but no finite amount of
training data justifies certainty.

> I've written a version of spamprob() that uses long integers,

Very long, in fact <wink>.

> does not clamp the probabilities and uses all evidence.

Not quite:  the number of times a word appears in a single msg feeds into
spamcount and hamcount, but doesn't in spamprob() (due to the Set()).  Note
that this also means hamcount/nham can be larger than 1 (similarly for
spam); I'm not sure whether you think you've taken that into account.

It would be very helpful if you could try explaining in English what you
think you're computing here, as parts of the code don't make sense to me.
For example,

>         real, frac = divmod(numerator, denominator)
>         huge = 1L<<30
>         if real > 0:
>             if real > huge:
>                 prob = 0.0
>             else:
>                 prob = 1.0 / (real + 1.0)
>         else:
>             if frac > huge:
>                 prob = 1.0
>             else:
>                 prob = frac / (1.0 + frac)
>         if evidence:

Suppose numerator = denominator-1:  they're as close as possible without
being equal.  Then real=0 and frac=numerator.  If numerator > 2**30, this
comes out with prob 1.0.  I'm unclear on what you intended to compute, but
it doesn't seem right that nearly equal numerator and denominator could lead
to certainty that a msg is spam.  Perhaps this is part of why your f-n rate
is 0.

> The results are interesting, IMHO.  First the rate summary:
> ...
> Perhaps there is some way we can swap the two rates by introducing some
> bias.

Here we go again <wink>.  It also suggests a way to get "a middle ground"
for Greg:  use two scoring schemes, one that has no false negatives and one
that has no false positives.  If a given msg gets conflicting predictions
from these two extremes, Greg reviews it by eyeball.