[spambayes-dev] "Tackling the Poor Assumptions of Naive Bayes TextClassifiers"

Wed Jan 14 21:28:56 EST 2004

[Phillip J. Eby]
> I ran across this paper today, that may be of some interest:
>
> http://haystack.lcs.mit.edu/papers/rennie.icml03.pdf

It is interesting (I've seen it before), but what this project does is so
far removed from a classical NBC (Naive Bayesian Classifier) that it's
unclear whether any of it can apply directly.  Interesting ideas for
research, though.

Just a few comments:

> ...
> # Adjust token counts for the power-law distribution of terms in
> normal text count[t] = log(count[t]+1)

We treat documents as a set of words, not a bag, because testing said
treating as a set worked better.  It's possible that using a log gimmick
would work better still (that's somewhere between "set" and "bag"), but it
wasn't tried.  (IOW, we're not using what the paper calls a multinomial
model.)

> ...
> The math definitely calls for new data structures, though, in that
> IIUC we only keep raw token counts, without a separate count of
> "messages this token was seen in".

Nope, *because* we treat a msg as a set of features, the ham count we store
for a token is equal to the number of distinct messages that contained the
token, and likewise for the spam count.

> The next steps involved using the *opposite* classification of a
> message to determine classification weights (i.e. ham and spam
> weights) for the tokens, and normalizing the weights in order to
> counteract training sample size bias.  I don't understand their math
> well enough to have any idea if their techniques are similar to the
> "experimental ham/spam imbalance adjustment" idea or not, or are
> things already done by the chi-square classifier.

No, they've got nothing in common, and *this* part wouldn't do us any good
at all.  An NBC is an N-way classifier, and as the paper says of this part:

    In contrast, CNB estimates parameters using data from all
    classes *except* c.  We think CNB's estimates will be more
    effective because each uses a more even amount of training
    data per class, which will lessen the bias in the weight
    estimates.

If N is substantially larger than 2, then basing the computations for a
specific class on N-1 of the classes instead of on just one should indeed be
"more even".  But when N=2, as it is in this project, there's no
difference -- it would just swap the roles of the two classes (IOW, N-1=1
when N=2, and that's all the math you need for this one <wink>).

One final point is that an NBC formally assumes statistical independence of
the features it's scoring.  The chi-squared combining method doesn't -- in
fact, the thrust of our combining method is to *exploit* correlation, and
our Unsure category is really the result of failing to find more correlation
in one direction than in the other.  That's not to say that correlation
can't hurt us too, but it appears to help us far more often than it hurts
us, and our combining method is based on detecting deviation from
independence regardless.