[Spambayes] Re: CRM114 in November breaks 99.9%. :-)
Brian Burton
brian at burton-computer.com
Tue Dec 3 05:58:04 2002
--On Monday, December 02, 2002 9:30 PM -0500 Bill Yerazunis <wsy@merl.com>
wrote:
> Training a particular incarnation of CRM114 usually takes a week or
> two; I read my mail (both categories) and when I find a piece of mail
> misclassified, I train that one piece into the filter.
Training only on errors after a cut-off point is interesting. Why do you
do this? Is there a reason not to increment the good/spam counts for terms
in every email? Is it to avoid overflowing the counts in your hash table
or is this likely to be more accurate since it keeps the message counts
small?
> After a couple of days the errors get very sparse; after two or three
> weeks, I "go for data" and that's what gets reported in the monthlies.
Perhaps I misunderstand, but doesn't that mean that you are training up to
a desirable accuracy before beginning to measure your accuracy? Is the
transition from training to performance measurement based on a
predetermined arbitrary cut off (i.e. 1,000 emails, x% of messages in
corpus, or 14 calendar days of training) or based on the accuracy rising to
a certain level?
All the best,
++Brian
More information about the Spambayes
mailing list