[Spambayes] chi-z combining: a worthless scheme

Tim Peters tim.one@comcast.net
Wed Oct 16 22:29:54 2002


If one other person thinks this is funny too, it was worth it <wink>.

Since the sum of squares of n unit-normal distributed vars follows a
chi-squared distribution with n degrees of freedom, here's Yet Another test
for rejecting the hypothesis that a vector of probs is uniformly
distributed:

        S = 0.0
        for p in ps:
            z = normIP(p)
            S += z*z
        S = chi2Q(S, len(ps))

This works as it should:  S is uniformly distributed when the input ps are
uniformly distributed.  But it combines the advantage of being equally
sensitive to high-spamprob and low-spamprob words, with a remarkable
disadvantage no other scheme to date has managed to achieve:  it gives very
low scores to ham *and* to spam, and very high scores to exceedingly bland
msgs.  Take that, BlandAssassin.