[Spambayes] Better optimization loop
Rob Hooft
rob@hooft.net
Sun Nov 17 08:28:26 2002
Further changes in the optimization (not yet checked in, but I assume
everyone is running trigrams now...)
I decided that we have a perfect way to optimize the ham and spam cutoff
values in timcv already, so that I can remove these from the simplex
optimization. To that goal I added a "delayed" flexcost to the
CostCounter module that can use the optimal cutoffs calculated at the
end of timcv.py. And there are only three variables left to optimize
using simplex
I then ran one optimization on my complete (16000+5800) corpus. The
result is that it is fighting very hard to remove fp's while introducing
lots of unsure messages:
At the start:
-> <stat> all runs false positives: 15
-> <stat> all runs false negatives: 7
-> <stat> all runs unsure: 189
Standard Cost: $194.80
Flex Cost: $607.41
Delayed-Standard Cost: $98.80
Delayed-Flex Cost: $310.05
x=0.4990 p=0.1002 s=0.4537 310.05
And near the end:
-> <stat> all runs false positives: 5
-> <stat> all runs false negatives: 6
-> <stat> all runs unsure: 342
-> <stat> all runs false positive %: 0.03125
-> <stat> all runs false negative %: 0.103448275862
-> <stat> all runs unsure %: 1.56880733945
-> <stat> all runs cost: $124.40
Standard Cost: $124.40
Flex Cost: $589.16
Delayed-Standard Cost: $98.60
Delayed-Flex Cost: $212.28
x=0.3515 p=0.2861 s=0.2467 212.28
At this stage it actually managed to get the delayed standard cost lower
by $0.20 (it has been higher than the starting value during much of the
optimization). The Delayed-Flex cost is lowered by about 30%. But look
at the hugely different parameters it had to use! Can someone else run
with these parameters and confirm that this is an extreme that is only
warranted by my particular corpses?
Please note that to get a delayed flex cost that is this much lower
actually means that in the unsure area there is "50% more order" than
before the optimization!
At some point Tim (was it you?) has reported that in other optimization
techniques it has proven to be very bad to "focus" on the persistent and
hopeless fp/fn messages. I fear this might bother me here.
I just started another optimization run, but lowered the cost of a fp
from $10 to $2, and introduced another cost function that I called
flex**2 cost because it changes the cost function for an unsure message
from a linear function to a square function. Oops, two changes at the
same time; but it takes such a long time to run....
More in 24 hours?
Regards,
Rob
--
Rob W.W. Hooft || rob@hooft.net || http://www.hooft.net/people/rob/
More information about the Spambayes
mailing list