[spambayes-dev] Another incremental training idea...

Wed Jan 14 20:26:47 EST 2004

> I've also been kicking around some auto-training ideas hoping for time 
> to try them.  One idea I had was based on a "sliding non-edge" scale. 
> You would set a max imbalance, say 2:1, beyond which you would train 
> on everything on the low side.
> As your imbalance falls back below the maximum, auto-train 
> would start skipping the "edge" messages with near perfect 
> classification scores.  The closer you get to a perfect 1:1 
> balance, the closer to the cutoff score the message would 
> need to be before it would get auto-trained.  Anyone see any 
> obvious holes in this idea?

I tried almost this with the incremental regime, using a maximum of 2::1 or
1::2.  It did pretty consistently worse than the basic nonedge regime.  The
only difference is that I didn't choose which messages to use if an
imbalance would be created.  The idea was basically to do nonedge, except if
there was an imbalance, and then only train messages that move the balance
closer to 1::1.

The balanced TOE you described (in a later message) is also similar to a
test I did (I called it 'balanced_perfect').  Again, the difference is in
the selection of which messages to use when there is an imbalance (I use the
first ones that come along, whereas you choose based on the score).

Basically any regime with which I tried using this method to keep the
database balanced did worse than just letting it go as normal.  As well as
the 2::1/1::2, I tried the perfect regime with 3::1 and 2::3, and that was
better, but still not as good as just the regular regime.

If I have time over the weekend, I'll try and come up with a different
self-balancing regime and try that (maybe along these lines, where the
messages to ignore are chosen based on score).

=Tony Meyer