[Spambayes] RE: For the bold

Rob Hooft rob@hooft.net
Sun, 06 Oct 2002 07:49:06 +0200


Tim Peters wrote:
> I believe "the bug" is in rmspik.chance(), which appears to assume that a
> zscore in the positive direction is an indicator of certainty.  That seems
> to be true in the logarithmic central-limit schemes, but isn't true in the
> original central-limit scheme.  Changing the first three lines like so:
> 
> #    if x>=0:
> #        return 1.0
> #    x=-x/math.sqrt(2)
>     x = abs(x)/math.sqrt(2)

Indeed, the chance function as I wrote it uses the information I had, 
which was only based on my clt2 experience.where positive Z-scores mean 
"absolute certainty", and negative Z-scores are increasingly uncertain. 
But: in practice, even for clt2, positive Z-scores above 2.0 do not 
appear very frequently if at all, and if/when that happens, the chance 
that the message belongs to the "other" group is extremely small. I just 
tried it for my clt2 data: your fix doesn't change anything there.

In case you're wondering what chance(x) is using under these if statements:

     if x < 1.4:
         return 1.0
     pre = math.exp(-x**2) / math.sqrt(math.pi) / x
     post = 1.0 - (1.0 / (2.0 * x**2))
     return pre * post

This is an approximation of the integral under the tail of the unit 
normal Gaussian, but the approximation only valid for x>>1 so for the 
"mass" of the curve, we just return 1.

Tim: It does look like your messages are a bit easier to classify than 
mine....

Rob

-- 
Rob W.W. Hooft  ||  rob@hooft.net  ||  http://www.hooft.net/people/rob/