[Spambayes] RE: For the bold
Rob Hooft
rob@hooft.net
Sun, 06 Oct 2002 07:49:06 +0200
Tim Peters wrote:
> I believe "the bug" is in rmspik.chance(), which appears to assume that a
> zscore in the positive direction is an indicator of certainty. That seems
> to be true in the logarithmic central-limit schemes, but isn't true in the
> original central-limit scheme. Changing the first three lines like so:
>
> # if x>=0:
> # return 1.0
> # x=-x/math.sqrt(2)
> x = abs(x)/math.sqrt(2)
Indeed, the chance function as I wrote it uses the information I had,
which was only based on my clt2 experience.where positive Z-scores mean
"absolute certainty", and negative Z-scores are increasingly uncertain.
But: in practice, even for clt2, positive Z-scores above 2.0 do not
appear very frequently if at all, and if/when that happens, the chance
that the message belongs to the "other" group is extremely small. I just
tried it for my clt2 data: your fix doesn't change anything there.
In case you're wondering what chance(x) is using under these if statements:
if x < 1.4:
return 1.0
pre = math.exp(-x**2) / math.sqrt(math.pi) / x
post = 1.0 - (1.0 / (2.0 * x**2))
return pre * post
This is an approximation of the integral under the tail of the unit
normal Gaussian, but the approximation only valid for x>>1 so for the
"mass" of the curve, we just return 1.
Tim: It does look like your messages are a bit easier to classify than
mine....
Rob
--
Rob W.W. Hooft || rob@hooft.net || http://www.hooft.net/people/rob/