[spambayes-dev] Another incremental training idea...

Wed Jan 14 22:25:30 EST 2004

[Kenny Pitt]
> My description applies to auto-balancing of "train on mistakes and
> unsures" instead of "train on everything" or "train on almost
> everything".  The algorithm could easily be reversed to do TOE where
> there are no configured edge thresholds.  Doing TOAE effectively would
> probably require your adjustment.
>
> For mistake-based training, the idea is that as long as my balance is
> very close to 1:1, I'm happy to train only on the messages that I
> manually reclassify because of mistakes and unsures.  If that
> mistake-based training causes an imbalance then the auto-balancer
> kicks in with an edge threshold very close to the classifier cutoff
> so that only the worst-scoring messages are trained.  As the
> imbalance worsens, the edge threshold is dynamically adjusted as
> needed to train on more messages and try to push the balance back
> towards 1:1.

FWIW, no matter which training strategy I decided to experiment with in
day-to-day Outlook use, that's the one I always ended up doing:  training on
Mistakes and Unsures, but forcing balance every few days by tossing in the
worst-scoring msgs in the under-represented category.  That's worked great
for me in real life, with unigrams for about a year, and again now with
bigrams but for less than a month.

Alas(?!), I'm getting a lot less spam than I used to -- since Christmas Eve
of 2003 (when I started saving all my email), I've only gotten 1834 of the
beasties, less than 100 per day.  It used to be well over 200 a day.  Maybe
the photos of my penis I sent out convinced spammers there's no point in
trying to sell snow to an Eskimo <wink>.