[spambayes-dev] "Tackling the Poor Assumptions of Naive Bayes Text
Classifiers"
Phillip J. Eby
pje at telecommunity.com
Wed Jan 14 19:27:58 EST 2004
I ran across this paper today, that may be of some interest:
http://haystack.lcs.mit.edu/papers/rennie.icml03.pdf
It specifically discusses:
* Classification bias due to unbalanced training data
* Classification bias due to clues that usually or always occur together
* Other errors due to the way word frequencies in text differ from an ideal
"Bayesian" model
It appears that their training approach calls for manipulating token counts
of the training documents, adjusting the count of a token 't' in a single
message, roughly as follows (if my math is correct):
# Adjust token counts for the power-law distribution of terms in normal text
count[t] = log(count[t]+1)
# Smooth noise caused by random correlations between frequently occurring
# tokens and a particular classification
count[t] *= log(totalMessagesTrained/numberOfMessagesContaining[t])
# Adjust for differences in token frequency probability based on size of
# the message
count[t] /= sqrt(sum([x*x for x in count.values()]))
...and then that's about where my math gives out, about halfway into their
training adjustments. The math definitely calls for new data structures,
though, in that IIUC we only keep raw token counts, without a separate
count of "messages this token was seen in".
The next steps involved using the *opposite* classification of a message to
determine classification weights (i.e. ham and spam weights) for the
tokens, and normalizing the weights in order to counteract training sample
size bias. I don't understand their math well enough to have any idea if
their techniques are similar to the "experimental ham/spam imbalance
adjustment" idea or not, or are things already done by the chi-square
classifier.
More information about the spambayes-dev
mailing list