[Spambayes] Trained two times as much spam as ham
Skip Montanaro
skip at pobox.com
Tue Jan 18 22:49:40 CET 2005
Kenny> Yes and no. The nature of spam makes it highly likely that the
Kenny> imbalance will continue to grow. As developers, we are very
Kenny> concerned about this and are trying to come up with some ideas to
Kenny> improve the situation. Unfortunately, it's a difficult problem
Kenny> to solve in a general way.
What I do in the contrib/tte.py code is
1. Run through the mailboxes in reverse - the assumption is that newer
messages are more important than older ones
2. When finishing up, a new mailbox is written. Any messages that were
correctly scored on each train-to-exhaustion pass are not written to
the new file. Any messages that were not considered at all (because
of ham/spam imbalance) are also scored at this point. If they score
correctly, they are not written to the new mailbox.
After completion, the user can decide whether or not to overwrite the old
mailbox with the new, often smaller, one. My current "best practice" is to
allow the spam mailbox to shrink as appropriate, but to never shrink the ham
mailbox. I think that may help keep the ham/spam imbalance from getting too
far out-of-whack.
The other thing I do is periodically trim both the ham and spam datasets to
some reasonable number (50-100 messages or so). That keeps any mistakes I
make (sometimes my fingers are faster than my brain) from getting too
entrenched. The downside is that I have to put up with some extra unsures
for a period of time.
It might be useful to codify some of these ideas into a tool the user can
run to reduce training dataset sizes without necessarily committing to the
train-to-exhaustion concept.
Skip
More information about the Spambayes
mailing list