[Spambayes] Trained two times as much spam as ham

Tue Jan 18 22:49:40 CET 2005

    Kenny> Yes and no.  The nature of spam makes it highly likely that the
    Kenny> imbalance will continue to grow.  As developers, we are very
    Kenny> concerned about this and are trying to come up with some ideas to
    Kenny> improve the situation.  Unfortunately, it's a difficult problem
    Kenny> to solve in a general way.

What I do in the contrib/tte.py code is

    1. Run through the mailboxes in reverse - the assumption is that newer
       messages are more important than older ones

    2. When finishing up, a new mailbox is written.  Any messages that were
       correctly scored on each train-to-exhaustion pass are not written to
       the new file.  Any messages that were not considered at all (because
       of ham/spam imbalance) are also scored at this point.  If they score
       correctly, they are not written to the new mailbox.

After completion, the user can decide whether or not to overwrite the old
mailbox with the new, often smaller, one.  My current "best practice" is to
allow the spam mailbox to shrink as appropriate, but to never shrink the ham
mailbox.  I think that may help keep the ham/spam imbalance from getting too
far out-of-whack.

The other thing I do is periodically trim both the ham and spam datasets to
some reasonable number (50-100 messages or so).  That keeps any mistakes I
make (sometimes my fingers are faster than my brain) from getting too
entrenched.  The downside is that I have to put up with some extra unsures
for a period of time.

It might be useful to codify some of these ideas into a tool the user can
run to reduce training dataset sizes without necessarily committing to the
train-to-exhaustion concept.

Skip