[spambayes-dev] train to exhaustion?

Thu Feb 12 09:15:04 EST 2004

Tony Meyer wrote:
> By "results stop improving", do you think that the intention is that
> the same number of messages are misclassified, or that the scores
> stop getting better?  (ie. if one message was still a false-positive,
> but moved from 0.8 to 0.7, is that improving?).

Gary's original blog entry defines train-to-exhaustion as the following:

"Training to exhaustion" is repeating training on error, with the same
message corpus, until no errors remain.

The "until no errors remain" says to me that you *want* to keep
iterating until that false-positive is correctly classified.  I would
think, then, that you would keep going as long as the score indicates
that you are getting closer to correct classification.

Where I'm a bit unclear is what to do if repeated training on that last
remaining false positive starts causing other messages to be
misclassified.  I wonder what would happen if you took an "incorrectness
score" that was the average of the distance from perfect classification
over all messages, and stop if that average ever increases?

In any case, this is a very computationally intensive process.  It seems
like it would be a good approach for initial training over a starting
corpus, but maybe not well suited to ongoing incremental training.

-- 
Kenny Pitt