[spambayes-dev] Give up on experimental_ham_spam_imbalance_adjustment?

Fri Sep 12 23:36:25 EDT 2003

experimental_ham_spam_imbalance_adjustment has been True in the Outlook
addin, but still False by default everywhere else (AFAIK).

I'd like to ask everyone running the Outlook addin to change it to False in
their default_bayes_customize.ini file, and just live with that for a week,
noting any new peculiarities.

What this option is all about:  We first compute a word's spamprob by
counting how often the word has appeared in ham and spam messages, and then
doing some arithmetic to produce a reasonable ratio between 0 and 1.  Call
that "by-counting" spamprob p.

p can't be used directly.  At the extreme, if a word was seen in one spam
and no ham before, p is exactly 1, and if seen in one ham and no spam
before, exactly 0.  If you feed any single spamprob of 0 or 1 into the
combining math, the end result will be 0 or 1, and regardless of what the
other spamprobs feeding into it are.  It's crazy in a statistical scheme to
let one piece of evidence completely determine the outcome (that's what
rule-based schemes are for, and they're brittle).

Intuitively, the real problem is that our by-counting guess is *only* a
guess, and has no claim to being absolute truth.  It only reflects what
we've trained on, so is only as reliable as that what we've trained on is a
perfect prediction of what we're going to see in the future -- but it isn't,
and there's no way to know in advance how far off from future reality it is.

Gary Robinson gave us a slick way to deal with this:  instead of using the
by-counting guess p, use a weighted average of p and 0.5
(unknown_word_prob).  The weight given to 0.5 is fixed at 0.45 today
(unknown_word_strength), and the weight given to p is what
experimental_ham_spam_imbalance_adjustment is all about.

When that's False, the weight given to p is the sum of the number of ham and
spam the word appeared in.  When True, and there's much more spam than ham
in the training data, or much more ham than spam, the weight given to p can
be much smaller than the sum of the number of messages the word has appeared
in.  This is *trying* to account for the intuition that when training data
is wildly unbalanced, we have much less reason to be confident about how
reliable a by-counting spamprob guess is.

One concrete example:  suppose we've trained on 30,000 ham and 100 spam, and
the word "fudge" appeared in 100 of those ham and none of those spam.  The
ratio of ham it's appeared in is then 0.003333..., the ratio of spam it's
appeared in is 0, and the by-counting spamprob p is

    0/(0 + 0.003333...) = 0

If we see a new message containing "fudge", how much weight should we give
to that spamprob of 0?  When experimental_ham_spam_imbalance_adjustment is
False, we give it a weight of 100 (the total # of training msgs it's
appeared in), and the weighted-average spamprob is

    0.0022399     option False

.  When the option is True, we give it a weight of 0.333333333, and the
weighted-average spamprob is then the (very! by a factor of > 128) much
milder

    0.287234043   option True

Who knows?  We've trained on so little spam (compared to ham) that they're
both wild-ass guesses.

Suppose the world had been a little different:  all the same, except that
"fudge" had appeared in a single training spam.  Then the by-counting
spamprob p would zoom from 0 (certain ham) to 0.75 (probably spam).  That's
an enormous difference for a 1-out-of-30100-messages change, and that alone
is a reason for being suspicious about a spamprob as strong as 0.0022399 in
the slightly different world we started with.  In our new slightly different
world, the straight and adjusted weighted-average guesses are

    0.7488911    option False
    0.686915888  option True

instead.

There are several things to note:

1. When the option is True, small changes in training data make smaller
   changes in final spamprob guesses.  I happen to think that's good,
   but the data may not agree.

2. The difference between True and False can be gigantic when (a)
   there is wild imbalance; and, (b) a token has never (yet) appeared
   in the class with the smaller amount of training data.  The difference
   between 0.0022399 and 0.287234043 above is huge, the difference
   between a very strong clue and a mild clue.

3. The difference between True and False can't be extreme for a token
   that's appeared in at least one ham and one spam.  The difference
   between 0.7488911 and 0.686915888 above is real but hardly dramatic.

Now in some earlier tests, some people (including me) reported better
results with unbalanced training data when setting the option True.  But the
imbalances I tried were much milder than some of the imbalances reported by
actual Outlook users (which have exceeded the factor of 300(!) in the
30000-ham-plus-100-spam example in this message).

There's also a difference between most of our testing and real life:  in
most testing, a classifier is built and then predicts some hundreds or
thousands of new messages.  That's not how the Outlook client is *used*,
though.  In real life, a relatively small batch of messages come in and then
the classifier is quickly trained on mistakes and unsures.

One unfortunate consequence of setting the option True is that adding even
more training data to the classification with the wildly larger number of
examples has little effect.  The system is already unhappy with the massive
imbalance, and increasing the imbalance just causes it to give even less
weight to the over-represented classification.  This may actually be good
for a static classifier that predicts thousands of messages without further
training, but all the evidence I'm seeing from real users with wild
imbalance is that it frustrates them due to the lack of instant
gratification as they futilely increase the imbalance again and again.

So I think this option just doesn't work in real life, but want to be a
little cautious about changing the default (I seem naturally to tend toward
a 2-to-1 imbalance, in favor of spam, in the 3 classifiers I use commonly,
and the imbalance adjustment hasn't seemed to hurt me a bit; I'm trying it
the other way not, and that doesn't seem to be hurting me a bit either).

Let me try to make the core a bit clearer here:  in the example at the
start, the adjustment hates a spamprob as strong as 0.0022399, because as
soon as "fudge" appears in the first spam, a hammy spamprob as strong as
0.0022399 has a very good chance of making the msg classify as Unsure
instead of as Spam.  The adjusted spamprob of 0.287234043 is very much less
likely to cause such a mistake.  That's what the adjustment is *trying* to
accomplish, and it succeeds at that.  OTOH, if ham containing "fudge"
continues to come in and get classed as Unsure, training even more of those
as ham doesn't do much to reduce "fudge"'s adjusted spamprob, and users get
frustrated.  Also, because of the way the addin gets used in real life, as
soon as that first spam containing "fudge" *does* come in and get classed as
Unsure (or even as Ham), the user will train on that instantly, and the
unadjusted spamprob will then instantly increase from the powerful 0.0022399
to the so-so 0.7488911, and "fudge" will never cause a problem again.