[spambayes-dev] Another incremental training idea...

Kenny Pitt kennypitt at hotmail.com
Wed Jan 14 09:36:02 EST 2004


Seth Goodman wrote:
> [Kenny Pitt]
>> I've also been kicking around some auto-training ideas hoping for
>> time to try them.  One idea I had was based on a "sliding non-edge"
> 
> Another related idea is to dynamically move the edge thresholds until
> the training ratio averages 1:1.

My description applies to auto-balancing of "train on mistakes and
unsures" instead of "train on everything" or "train on almost
everything".  The algorithm could easily be reversed to do TOE where
there are no configured edge thresholds.  Doing TOAE effectively would
probably require your adjustment.

For mistake-based training, the idea is that as long as my balance is
very close to 1:1, I'm happy to train only on the messages that I
manually reclassify because of mistakes and unsures.  If that
mistake-based training causes an imbalance then the auto-balancer kicks
in with an edge threshold very close to the classifier cutoff so that
only the worst-scoring messages are trained.  As the imbalance worsens,
the edge threshold is dynamically adjusted as needed to train on more
messages and try to push the balance back towards 1:1.

For TOE it would be the exact opposite.  I train on all ham and spam as
long as the balance remains at 1:1.  If I start to get an imbalance,
then the edge threshold of the high side is adjusted so that the
best-scoring messages are no longer trained.  As the imbalance gets
worse, the edge threshold is adjusted so that fewer and fewer messages
are trained.

TOAE could be accomplished the same way as TOE simply by obeying the
configured static edge thresholds as limits for the auto-adjusted
thresholds, but this doesn't account for the case where the configured
thresholds discard too many messages for proper balancing.  This is
where you might want to dynamically adjust the configured thresholds, at
least until you get back in balance.

-- 
Kenny Pitt




More information about the spambayes-dev mailing list