[Spambayes-checkins] RE: [Spambayes] On counting words more than once

Tim Peters tim.one@comcast.net
Sun, 29 Sep 2002 14:08:05 -0400


SF still isn't able to mail checkin notifications.

Because Neil, Guido and I all reported improvement via counting duplicate
words (within a message) only once during training, I removed the recent
option for trying this, and we do this all the time now.  The checkin
comment is below.  Note that you may need to change spam_cutoff!

"""
Removed option count_duplicates_only_once_in_training:  this is always
done now.  Counting duplicate words in a msg more than once during
training appears to have been helpful under the Graham scheme only because
it acted to counteract other biases.

Under Robinson's unbiased scheme, results improve by counting duplicates
only once during training (just as duplicates are counted only once during
scoring), the ham score mean decreases significantly and consistently,
likewise ham score variance, the spam score mean decreases consistently
(but less than the ham mean decreased, so the spread increases), and spam
score variance increaeses.  That implies there's *some* value to be gotten
out of knowing how often a word appears in a msg, but that distorting
spamprob isn't the right way to exploit it.

WordInfo.hamcount now has a different meaning:  it's the number of hams in
which the word appears, instead of the number of times the word appears
across all ham.  Likewise for WordInfo.spamcount.

Note that because both mean scores decreased, you'll probably want a
smaller spam_cutoff value now.  The default spam_cutoff has been changed
from 0.57 to 0.56.  But this is corpus-dependent, so be sure to tune your
value for your corpus.
"""