[spambayes-dev] Another incremental training idea...

T. Alexander Popiel popiel at wolfskeep.com
Wed Jan 14 13:22:56 EST 2004


In message:  <MHEGIFHMACFNNIMMBACAMEFFHDAA.nobody at spamcop.net>
             "Seth Goodman" <nobody at spamcop.net> writes:
>
>I do have a question on your incremental harness with expiry, since it's
>surprising how much worse it performs as soon as it starts expiring
>messages.  For classification purposes, you obviously use the training set
>from the last 120 days of nonedge messages.  Do you then use those same
>scores for the current day's messages to determine which are the nonedge
>messages?  I ask this because you would get a different set of messages to
>train on, and perhaps compensate better for the particular messages you
>expire, if you first expired the 120-day old messages, then rescored the
>current day's messages to determine the nonedge messages to train on.  Does
>this make any sense?

Well, here's the regime code:


###
### This is a training regime for the incremental.py harness.
### It does perfect training for all messages not already
### properly classified with extreme confidence.
###

class nonedgeexpire:
    def __init__(self):
        self.ham = [[]]
        self.spam = [[]]

    def group_action(self, which, test):
        if len(self.ham) >= 120:
            test.untrain(self.ham[119], self.spam[119])
            self.ham = self.ham[:119]
            self.spam = self.spam[:119]
        self.ham.insert(-1, [])
        self.spam.insert(-1, [])

    def guess_action(self, which, test, guess, actual, msg):
        if guess[0] != actual:
            if actual < 0:
                self.spam[0].append(msg)
            else:
                self.ham[0].append(msg)
            return actual
        if 0.005 < guess[1] and guess[1] < 0.995:
            if actual < 0:
                self.spam[0].append(msg)
            else:
                self.ham[0].append(msg)
            return actual
        return 0


This code trains immediately on the non-edge stuff, and expires
at the end of each day.  It does not choose the messages to train
for the day after expiring, as you suggest.  Your suggestion is
interesting, though it would be a bit expensive to do (doubling
the number of classifications done).

- Alex



More information about the spambayes-dev mailing list