[Spambayes] Frequency distribution for wordinfo counts?
skip at pobox.com
Tue Feb 24 12:16:56 EST 2004
>>>>> "Tony" == Tony Meyer <tameyer at ihug.co.nz> writes:
Tony> [Training to exhaustion]
>> Seems to work pretty well. Here's a run I did just now:
>> % python ~/tmp/spambayes/contrib/tte.py -g
>> newham.clean.save -s newspam.clean.save -d tte.db
>> round: 1, msgs: 770, ham misses: 196, spam misses: 244, 67.7s
>> round: 2, msgs: 770, ham misses: 33, spam misses: 55, 49.4s
>> round: 3, msgs: 770, ham misses: 8, spam misses: 5, 33.1s
>> round: 4, msgs: 770, ham misses: 0, spam misses: 0, 28.6s
>> 1 untrained spams
Tony> How did these 770 messages get selected? Is this a batch of
Tony> recently arrived mail, or some sort of pre-selected training
Tony> collection? Did tte.db exist before this?
I have two piles of mail selected by me, one ham, one spam as indicated by
the command line above. The tte.py script just iterates over them, training
a message from one, then a message from the other. tte.db is written from
scratch on each run, but not twiddled between rounds of a single run.
>> Adding up the last column indicates a total run time of about three
>> minutes. I can live with that.
Tony> How often do you tend to run this?
Right now a few times a day. I've been out for a week, so I have lots of
unsures to train on. I select a few hams and spams, run tte.py then put the
database in place. Every once in awhile I reprocess the entire unsure pile
(825 messages at the moment, but it was over 2500 when I got back from
vacation). I didn't have things adjusted very well before I left.
>> The database thus winds up smaller than it would be with a more usual
>> training approach.
Tony> Although slightly larger than mistake-based-training (541 instead
Tony> of 440), but presumably more accurate as well.
Who knows? ;-)
More information about the Spambayes