[spambayes-dev] "Tackling the Poor Assumptions of Naive Bayes Text Classifiers"

Phillip J. Eby pje at telecommunity.com
Wed Jan 14 19:27:58 EST 2004

I ran across this paper today, that may be of some interest:


It specifically discusses:

* Classification bias due to unbalanced training data
* Classification bias due to clues that usually or always occur together
* Other errors due to the way word frequencies in text differ from an ideal 
"Bayesian" model

It appears that their training approach calls for manipulating token counts 
of the training documents, adjusting the count of a token 't' in a single 
message, roughly as follows (if my math is correct):

# Adjust token counts for the power-law distribution of terms in normal text
count[t] = log(count[t]+1)

# Smooth noise caused by random correlations between frequently occurring
# tokens and a particular classification
count[t] *= log(totalMessagesTrained/numberOfMessagesContaining[t])

# Adjust for differences in token frequency probability based on size of
# the message
count[t] /= sqrt(sum([x*x for x in count.values()]))

...and then that's about where my math gives out, about halfway into their 
training adjustments.  The math definitely calls for new data structures, 
though, in that IIUC we only keep raw token counts, without a separate 
count of "messages this token was seen in".

The next steps involved using the *opposite* classification of a message to 
determine classification weights (i.e. ham and spam weights) for the 
tokens, and normalizing the weights in order to counteract training sample 
size bias.  I don't understand their math well enough to have any idea if 
their techniques are similar to the "experimental ham/spam imbalance 
adjustment" idea or not, or are things already done by the chi-square 

More information about the spambayes-dev mailing list