[Spambayes] chi-square

Thu, 10 Oct 2002 08:45:53 -0400

> 
>> It wouldn't be invoking that optimality theorem, but whatever works...
> 
> I'm not sure the optimality theorem in question is relevant to the task at
> hand, though.  Why should we care abour rejecting a hypothesis that the word
> probabilities are uniformly distributed?  There's virtually no message in
> which they are, and no reason to believe that the *majority* of words in
> spam will have spamprobs over 0.5.  Graham got results as good as he did
> because the spamprob strength of a mere handful of words is usually enough
> to decide it.  In a sense, I am trying to move back toward what worked best
> in his formulation.

Right, I agree and I've noted earlier that because the variables aren't
independent this isn't really an "optimal" use of the optimality theorem. ;)
Nevertheless, I think it is a good idea to come as close as we can to
invoking it, because even approximately invoking such a theorem is often
better than a doing something which has no real mathematics underlying it at
all. 

> There's a dramatic difference in the Paul results, while the Gary results
> move sublty (in comparison).
> 
> If we force 10 additional .99 spamprobs, the differences are night and day:
> 
> Result for random vectors of 50 probs, + 10 forced to 0.99
> 

[Histogram here]

> 
> It's hard to know what to make of this, especially in light of the claim
> that Gary-combining has been proven to be the most sensitive possible test
> for rejecting the hypothesis that a collection of probs is uniformly
> distributed.  At least in this test, Paul-combining seemed far more
> sensitive (even when the data is random <wink>).

If you do the chi-square transformation, it should respond strongly to this
experiment, because it figures out a probability in association with that
kind of distortion.

That is, doing the inverse chi-square thing uncovers the probablistic
information that is now completely buried in the product of the p's, and
that can only emerge when the number of p's is considered, which is done by
means of the inverse chi-square computation. The number of p's is currently
ignored; when it is considered a very different result will emerge.

Look at it this way. You're saying that in your experiment 17% of the p's
are artificially forced to .99. If there are 6 p's to start with, 17% would
only mean 1 p was skewed and that is not very unusual. But if you had
1,000,000 p's, and 17% of them were totally out-of-whack with a uniform
distribution, the odds against it happening by chance alone would be
completely astronomical.

So, you have to figure in the number of p's if you want to get anything like
a real probability.

You can compute that real probability using the inverse chi-square calc.
Otherwise all the probabilistic detail is lost; it just gets buried in the
process of calculating the geometric mean.

If you are playing with different cutoffs, the details that are lost when
you don't do the inverse chi-square calc may really matter. They DON'T
matter if you are only using a .5 cutoff, because the monotonic property
we've discussed means that a binary choice based on a .5 cutoff will be the
same either way. But the details will matter more as you get away from .5
for the cutoff.

--Gary

-- 
Gary Robinson
CEO
Transpose, LLC
grobinson@transpose.com
207-942-3463
http://www.emergentmusic.com
http://radio.weblogs.com/0101454