[Spambayes] Re: CRM114 in November breaks 99.9%. :-)
Ken Anderson
kanderson at bbn.com
Tue Dec 3 02:00:40 2002
Yes, this is my concern. I think the approach Robert describes is perfectly find for adaptively learning how to filter email, though there should probably be some for of forgetting, though the system will eventually forget on its own as words occur less often.
However, if this is the approach Bill uses, you can't use to for performance estimates. Our speech and natural language group is very careful not to mix its training set with its test set. When they do, they do something like 10 fold cross validation which averages (?) the results of 10 experiments that take some random fraction of the data as training and the rest as testing.
This gives a lower performance score that is likely to be more accurate on real data.
If your getting 3 9's be sure you're getting them the hard way.
k
At 05:35 PM 12/2/2002, Robert Woodhead wrote:
>At 11:04 AM -0500 12/2/02, Ken Anderson wrote:
>>The "train only on errors" bothers me. Can you say what you use for a training set and what you use for a test set?
>
>Yeah, have you considered training on everything? That is to say, have CRM classify an email, assume it is correct, and train on it. Then, if an email comes through as false positive or negative (an error), you tell CRM to untrain on that email only.
>
>R
>
>--
>===========================================================
>Robert Woodhead, CEO, AnimEigo http://www.animeigo.com/
>===========================================================
>http://selfpromotion.com/ The Net's only URL registration
>SHARESERVICE. A power tool for power webmasters.
More information about the Spambayes
mailing list