[spambayes-dev] Another incremental training idea...
Tim Peters
tim.one at comcast.net
Wed Jan 14 22:25:30 EST 2004
[Kenny Pitt]
> My description applies to auto-balancing of "train on mistakes and
> unsures" instead of "train on everything" or "train on almost
> everything". The algorithm could easily be reversed to do TOE where
> there are no configured edge thresholds. Doing TOAE effectively would
> probably require your adjustment.
> For mistake-based training, the idea is that as long as my balance is
> very close to 1:1, I'm happy to train only on the messages that I
> manually reclassify because of mistakes and unsures. If that
> mistake-based training causes an imbalance then the auto-balancer
> kicks in with an edge threshold very close to the classifier cutoff
> so that only the worst-scoring messages are trained. As the
> imbalance worsens, the edge threshold is dynamically adjusted as
> needed to train on more messages and try to push the balance back
> towards 1:1.
FWIW, no matter which training strategy I decided to experiment with in
day-to-day Outlook use, that's the one I always ended up doing: training on
Mistakes and Unsures, but forcing balance every few days by tossing in the
worst-scoring msgs in the under-represented category. That's worked great
for me in real life, with unigrams for about a year, and again now with
bigrams but for less than a month.
Alas(?!), I'm getting a lot less spam than I used to -- since Christmas Eve
of 2003 (when I started saving all my email), I've only gotten 1834 of the
beasties, less than 100 per day. It used to be well over 200 a day. Maybe
the photos of my penis I sent out convinced spammers there's no point in
trying to sell snow to an Eskimo <wink>.
More information about the spambayes-dev
mailing list