[Spambayes] Frequency distribution for wordinfo counts?
skip at pobox.com
Mon Feb 23 20:04:40 EST 2004
>> I'm coming into this late, but thought I'd post my numbers. As far
>> as I know, I'm the only person using train-to-exhaustion at the
>> moment. That probably skews my numbers, so maybe they'll be of
Tony> How well is this working for you? Is it really slow? Do you have
Tony> it set to only use a subset of mail, or is it
Tony> training-to-exhaustion on the whole lot?
Seems to work pretty well. Here's a run I did just now:
% python ~/tmp/spambayes/contrib/tte.py -g newham.clean.save -s newspam.clean.save -d tte.db
round: 1, msgs: 770, ham misses: 196, spam misses: 244, 67.7s
round: 2, msgs: 770, ham misses: 33, spam misses: 55, 49.4s
round: 3, msgs: 770, ham misses: 8, spam misses: 5, 33.1s
round: 4, msgs: 770, ham misses: 0, spam misses: 0, 28.6s
1 untrained spams
Adding up the last column indicates a total run time of about three minutes.
I can live with that.
Note that even though I fed it 770 messages, only 541 messages (some of them
were duplicates) actually contributed to the final database:
% spamcounts -d tte.db 'saved state'
The database thus winds up smaller than it would be with a more usual
More information about the Spambayes