[spambayes-dev] Another incremental training idea...

Tim Peters tim.one at comcast.net
Wed Jan 14 22:05:17 EST 2004


[Skip Montanaro]
> ...
> It does seem a bit arbitrary, but the system seems to suggest
> we need to be slaves to balance and that's one way to get it.

Cross validation testing is measuring random-time-order TOE performance, and
we know imbalance hurts that.  We also have overwhelming anecdotal evidence
that extreme imbalance hurts users of the Outlook addin, and seemingly no
matter how they train (but understanding that the Outlook UI makes it
difficult to do any kind of training other than "train on everything in
such-a-such set of folders, plus mistakes and unsures" -- so we end up with
OL users training on 20,000 ham from the last 5 years, plus the 10 spam they
got yesterday).

I don't think we've seen enough to draw a conclusion about non-insane
imbalance in other ways of training.  Alex has presented the most evidence
about longer-term effects of non-TOE, time-respecting training, and he seems
to do OK under those despite that his imbalance gets worse over time (and
certainly more imbalanced than I can tolerate in a variety of real-life ad
hoc training regimes).  OTOH, that's only one corpus, and Alex is weird
<wink>.




More information about the spambayes-dev mailing list