[Spambayes] Sequemtial Test Results

Tim Peters tim.one@comcast.net
Sun, 06 Oct 2002 01:00:56 -0400


[Jim Bublitz]
> ...
> Playing around with Spambayes, I get slightly better results if I
> ...
> c) drop robinson_probability_s to .05,

That's a very low value.  I find this way of rewriting Gary's adjustment
easier to reason about:

    s*x + n*p          x - p
    --------- =  p +  -------
      s + n           1 + n/s

This makes it clear that it moves p in the direction of x, but less so the
larger n is, or the smaller s is.  For you, s=.05, and then that's

          x-p
    p +  ------
         1+20*n

At n=1, that's p + (x-p)/21.  The *interesting* <wink> thing there is that,
since you said you effectively removed Graham's mincount gimmick, under pure
Graham you *were* getting extreme spamprobs of 0.01 and 0.99 for words that
had been seen only once in the training data.  Setting s to 0.05 gives a
very similar effect under Gary's adjustment.  If x is 0.5,

0 + .5/21 ~= 0.024

and

1 + -.5/21 ~= 0.976

Those are really extreme probability estimates based on 1 measly occurence
in training data, but perhaps this ties in to the unusual nature of your
data.  For example, I've seen that low s helps ham message threads when a
typo or unusual word gets repeated in replies.