[Spambayes] chi-z combining: a worthless scheme
Tim Peters
tim.one@comcast.net
Wed Oct 16 22:29:54 2002
If one other person thinks this is funny too, it was worth it <wink>.
Since the sum of squares of n unit-normal distributed vars follows a
chi-squared distribution with n degrees of freedom, here's Yet Another test
for rejecting the hypothesis that a vector of probs is uniformly
distributed:
S = 0.0
for p in ps:
z = normIP(p)
S += z*z
S = chi2Q(S, len(ps))
This works as it should: S is uniformly distributed when the input ps are
uniformly distributed. But it combines the advantage of being equally
sensitive to high-spamprob and low-spamprob words, with a remarkable
disadvantage no other scheme to date has managed to achieve: it gives very
low scores to ham *and* to spam, and very high scores to exceedingly bland
msgs. Take that, BlandAssassin.