[Spambayes] Frequency distribution for wordinfo counts?

Skip Montanaro skip at pobox.com
Tue Feb 24 12:16:56 EST 2004


>>>>> "Tony" == Tony Meyer <tameyer at ihug.co.nz> writes:

    Tony> [Training to exhaustion]
    >> Seems to work pretty well.  Here's a run I did just now:
    >> 
    >> % python ~/tmp/spambayes/contrib/tte.py -g 
    >> newham.clean.save -s newspam.clean.save -d tte.db 
    >> round:  1, msgs:  770, ham misses: 196, spam misses: 244, 67.7s
    >> round:  2, msgs:  770, ham misses:  33, spam misses:  55, 49.4s
    >> round:  3, msgs:  770, ham misses:   8, spam misses:   5, 33.1s
    >> round:  4, msgs:  770, ham misses:   0, spam misses:   0, 28.6s
    >> 1 untrained spams

    Tony> How did these 770 messages get selected?  Is this a batch of
    Tony> recently arrived mail, or some sort of pre-selected training
    Tony> collection?  Did tte.db exist before this?

I have two piles of mail selected by me, one ham, one spam as indicated by
the command line above.  The tte.py script just iterates over them, training
a message from one, then a message from the other.  tte.db is written from
scratch on each run, but not twiddled between rounds of a single run.

    >> Adding up the last column indicates a total run time of about three
    >> minutes. I can live with that.

    Tony> How often do you tend to run this?

Right now a few times a day.  I've been out for a week, so I have lots of
unsures to train on.  I select a few hams and spams, run tte.py then put the
database in place.  Every once in awhile I reprocess the entire unsure pile
(825 messages at the moment, but it was over 2500 when I got back from
vacation).  I didn't have things adjusted very well before I left.

    >> The database thus winds up smaller than it would be with a more usual
    >> training approach.

    Tony> Although slightly larger than mistake-based-training (541 instead
    Tony> of 440), but presumably more accurate as well.

Who knows? ;-)

Skip





More information about the Spambayes mailing list