[spambayes-dev] train to exhaustion?

Thu Feb 12 17:51:37 EST 2004

    Kenny> Tony Meyer wrote:
    >> By "results stop improving", do you think that the intention is that
    >> the same number of messages are misclassified, or that the scores
    >> stop getting better?  (ie. if one message was still a false-positive,
    >> but moved from 0.8 to 0.7, is that improving?).

    Kenny> Gary's original blog entry defines train-to-exhaustion as the
    Kenny> following:

    Kenny> "Training to exhaustion" is repeating training on error, with the
    Kenny> same message corpus, until no errors remain.

    Kenny> The "until no errors remain" says to me that you *want* to keep
    Kenny> iterating until that false-positive is correctly classified.  

That's how I interpreted it as well when I wrote tte.py.  With my current
training database (roughly 700 total messages, evenly split between hams and
spams) it takes five passes through the database (two to three minutes) to
correctly classify all messages.  Each pass is fastet than its predecessor
because it trains on fewer messages.

    Kenny> I would think, then, that you would keep going as long as the
    Kenny> score indicates that you are getting closer to correct
    Kenny> classification.

And stop once all ham score at or below the ham_cutoff and all spam score at
or above the spam_cutoff.

    Kenny> Where I'm a bit unclear is what to do if repeated training on
    Kenny> that last remaining false positive starts causing other messages
    Kenny> to be misclassified.

I think you keep at it.  The tte.py script scores each message on each pass,
ignoring the results for that message on previous passes.  If it scores out
of the zone on this pass it is trained.  It doesn't matter if it was in the
zone on an earlier pass.

I look at that sort of thing this way.  I have some hams and some spams with
significant enough numbers of tokens in common.  By repeatedly training on
those messages we discount the value of those shared tokens and increase the
value of each message's unique tokens.

    Kenny> I wonder what would happen if you took an "incorrectness score"
    Kenny> that was the average of the distance from perfect classification
    Kenny> over all messages, and stop if that average ever increases?

I don't understand what you're suggesting.  What is "perfect classification
over all messages"?

Skip