[Spambayes] Frequency distribution for wordinfo counts?
tameyer at ihug.co.nz
Mon Feb 23 21:28:16 EST 2004
[Training to exhaustion]
> Seems to work pretty well. Here's a run I did just now:
> % python ~/tmp/spambayes/contrib/tte.py -g
> newham.clean.save -s newspam.clean.save -d tte.db
> round: 1, msgs: 770, ham misses: 196, spam misses: 244, 67.7s
> round: 2, msgs: 770, ham misses: 33, spam misses: 55, 49.4s
> round: 3, msgs: 770, ham misses: 8, spam misses: 5, 33.1s
> round: 4, msgs: 770, ham misses: 0, spam misses: 0, 28.6s
> 1 untrained spams
How did these 770 messages get selected? Is this a batch of recently
arrived mail, or some sort of pre-selected training collection? Did tte.db
exist before this?
> Adding up the last column indicates a total run time of about
> three minutes. I can live with that.
How often do you tend to run this?
> The database thus winds up smaller than it would be with a
> more usual training approach.
Although slightly larger than mistake-based-training (541 instead of 440),
but presumably more accurate as well.
More information about the Spambayes