[Spambayes] Chi-square scoring

Richard Jowsey richard at jowsey.com
Fri Jan 31 10:29:55 EST 2003

Hi again Gary,

I've implemented your prob-combining technik and a chi-squared 
function in Java, and have run some very revealing tests. The 
first observation I'd make is that *any* measure of "spamminess" 
is only as good as the good/junk word databases. So I've done a 
fair amount of experimentation on ways to fine-tune my training 
corpus, especially wrt the careful quarantining of messages 
which are incorrectly classified, or are decidedly "unsure" and 
will probably remain so forever... <grin>

Now, with a high-Q database, the probability distributions 
(pSpam) for the training corpus very closely approximate two 
binomial/normal distributions, with means around 0.25 and 0.75, 
and standard deviations of approx 1/12 (0.083), which is exactly 
what we'd expect from first principles, n'est-ce pas?

In theory then, the 95%-confidence boundaries of an "unsure" 
zone (centered around pSpam=0.5) can be defined as pSpam falling 
between the 2-sigma points of the training distributions:
   Unsure lower limit: 0.25 + (2 * 1/12) = 0.417
   Unsure upper limit: 0.75 - (2 * 1/12) = 0.583

In repeated testing, this simple approach provides reliable 
classification of randomly-selected streams of incoming email, 
viz. ~zero false positives and extremely accurate "uncertains". 
For comparison, I've also run the same streams through your chi-
squared test, with (as you suggested) the null hypothesis being 
some normal distribution around 0.5, i.e. "I'm absolutely 
uncertain about anything". The outcomes are remarkably similar 
to my 2-sigma approach, but now the unsure zone is "stretched" 
logarithmically between chi-2 scores of ~0.15 and ~0.85. And 
yes, the same bunch of messages drop into the spam/unsure/ham 
regions, whichever scoring method is used.  :-)

Conclusions?  After 1st-pass training, the good/junk word 
databases should definitely be re-tuned against the corpus. A 
low-Q database will simply "muddy" the classifier, irrespective 
of statistical technique. In such a poor signal/noise scenario, 
with lots of "unsures" in the corpus and/or in the sample 
stream, chi-2 scoring is a definite plus! However, this test is 
fairly expensive computationally, so in practice we might only 
need to perform chi-2 when a message's raw pSpam falls between, 
say, 0.25 and 0.75 (which approach gives exactly the same 
outcomes, but is considerably faster when proxying).

I can post you testing logs depicting these various results if 
you're interested...


More information about the Spambayes mailing list