[Spambayes] statistical comparison of enviroment?

Thu Mar 6 01:01:12 EST 2003

[Skip Montanaro]
> ...
> Suppose improvement A takes that to 83% and applied independently to the
> base system, improvement B takes that to 85%.  How do you tell how
> independent A and B are from one another?

It's a well studied area, and any std work on experimental design will cover
it.  Picture an analogy:  spam == disease, and various kinds of clues are
various drugs claimed to cure the disease (or test procedure claimed to
identify the disease).  A proper experimental design can quantify which
drugs work and how well, which combinations are better than the sum of their
parts, and which worse.  This is a messy combinatorial problem, though, and
real-life experiments rarely try to tackle more than a few drugs at a time.
Then again, despite the howling of the perturbed, few people actually die
from a spam that leaks thru <wink>.

If I had time, I'd rather investigate Adaboost (mentioned several times here
long ago) as a means to combine various kinds of clues as if they were each
classifiers on their own.  Adaboost is a general approach to combining
multiple classifiers so that the combined classifier is better than any of
its parts, provided only (roughly speaking) that each classifier going into
it does better than chance.  For example, we've seen here that a header-only
classifier can do very well, and so can a classifier than looks only at msg
bodies.  The *best* way to combine those two may very well not be simply
lumping them together as equals.  I ran experiments on a classifier that
looked only at Subject lines, and reported here that it had error rates down
around 5% all by itself.  Etc:  there are lots of little classifiers you
*could* build out of our code base.

Chi-combining gives each kind of clue (token) equal weight, and there's no
reason to believe that's optimal.  Gary Robinson once suggested a variant on
the geometric-mean approaches that weighted tokens differently by giving
each an exponent derived from its spamprob (instead of giving each one
exponent 1/n, where n is the # of tokens).  I couldn't make time to pursue
that then.  In a sense, Adaboost is a way of weighting a collection of
classifiers where the data *tells* you good weights to use, instead of
dreaming up an a priori weighting scheme.  Lots of "learning" algorithms do
a similar thing, but Adaboost enjoys a long list of provably good
performance and convergence properties.

OTOH, if you come up with a better scheme, my original 35K collection of
test msgs can't demonstrate it (spambayes already does a
perfect-as-it-can-be job on it).  OTOH, lots of marginal decisions were
based on that specific collection, and I'm sure some of them would have been
decided differently if anyohe else had spent 20 hours a day for two months
dreaming up tests on their test corpus <wink>.