[Spambayes] Chi-square scoring
richard at jowsey.com
Fri Jan 31 10:29:55 EST 2003
Hi again Gary,
I've implemented your prob-combining technik and a chi-squared
function in Java, and have run some very revealing tests. The
first observation I'd make is that *any* measure of "spamminess"
is only as good as the good/junk word databases. So I've done a
fair amount of experimentation on ways to fine-tune my training
corpus, especially wrt the careful quarantining of messages
which are incorrectly classified, or are decidedly "unsure" and
will probably remain so forever... <grin>
Now, with a high-Q database, the probability distributions
(pSpam) for the training corpus very closely approximate two
binomial/normal distributions, with means around 0.25 and 0.75,
and standard deviations of approx 1/12 (0.083), which is exactly
what we'd expect from first principles, n'est-ce pas?
In theory then, the 95%-confidence boundaries of an "unsure"
zone (centered around pSpam=0.5) can be defined as pSpam falling
between the 2-sigma points of the training distributions:
Unsure lower limit: 0.25 + (2 * 1/12) = 0.417
Unsure upper limit: 0.75 - (2 * 1/12) = 0.583
In repeated testing, this simple approach provides reliable
classification of randomly-selected streams of incoming email,
viz. ~zero false positives and extremely accurate "uncertains".
For comparison, I've also run the same streams through your chi-
squared test, with (as you suggested) the null hypothesis being
some normal distribution around 0.5, i.e. "I'm absolutely
uncertain about anything". The outcomes are remarkably similar
to my 2-sigma approach, but now the unsure zone is "stretched"
logarithmically between chi-2 scores of ~0.15 and ~0.85. And
yes, the same bunch of messages drop into the spam/unsure/ham
regions, whichever scoring method is used. :-)
Conclusions? After 1st-pass training, the good/junk word
databases should definitely be re-tuned against the corpus. A
low-Q database will simply "muddy" the classifier, irrespective
of statistical technique. In such a poor signal/noise scenario,
with lots of "unsures" in the corpus and/or in the sample
stream, chi-2 scoring is a definite plus! However, this test is
fairly expensive computationally, so in practice we might only
need to perform chi-2 when a message's raw pSpam falls between,
say, 0.25 and 0.75 (which approach gives exactly the same
outcomes, but is considerably faster when proxying).
I can post you testing logs depicting these various results if
More information about the Spambayes