[Spambayes] Better optimization loop

Sun Nov 17 08:28:26 2002

Further changes in the optimization (not yet checked in, but I assume 
everyone is running trigrams now...)

I decided that we have a perfect way to optimize the ham and spam cutoff 
values in timcv already, so that I can remove these from the simplex 
optimization. To that goal I added a "delayed" flexcost to the 
CostCounter module that can use the optimal cutoffs calculated at the 
end of timcv.py. And there are only three variables left to optimize 
using simplex

I then ran one optimization on my complete (16000+5800) corpus. The 
result is that it is fighting very hard to remove fp's while introducing 
lots of unsure messages:

At the start:

-> <stat> all runs false positives: 15
-> <stat> all runs false negatives: 7
-> <stat> all runs unsure: 189
Standard Cost: $194.80
Flex Cost: $607.41
Delayed-Standard Cost: $98.80
Delayed-Flex Cost: $310.05
x=0.4990 p=0.1002 s=0.4537 310.05

And near the end:

-> <stat> all runs false positives: 5
-> <stat> all runs false negatives: 6
-> <stat> all runs unsure: 342
-> <stat> all runs false positive %: 0.03125
-> <stat> all runs false negative %: 0.103448275862
-> <stat> all runs unsure %: 1.56880733945
-> <stat> all runs cost: $124.40
Standard Cost: $124.40
Flex Cost: $589.16
Delayed-Standard Cost: $98.60
Delayed-Flex Cost: $212.28
x=0.3515 p=0.2861 s=0.2467 212.28

At this stage it actually managed to get the delayed standard cost lower 
by $0.20 (it has been higher than the starting value during much of the 
optimization). The Delayed-Flex cost is lowered by about 30%. But look 
at the hugely different parameters it had to use! Can someone else run 
with these parameters and confirm that this is an extreme that is only 
warranted by my particular corpses?

Please note that to get a delayed flex cost that is this much lower 
actually means that in the unsure area there is "50% more order" than 
before the optimization!

At some point Tim (was it you?) has reported that in other optimization 
techniques it has proven to be very bad to "focus" on the persistent and 
hopeless fp/fn messages. I fear this might bother me here.

I just started another optimization run, but lowered the cost of a fp 
from $10 to $2, and introduced another cost function that I called 
flex**2 cost because it changes the cost function for an unsure message 
from a linear function to a square function. Oops, two changes at the 
same time; but it takes such a long time to run....