[Spambayes] Better optimization loop

Rob Hooft rob@hooft.net
Fri Nov 15 21:53:47 2002

I've been playing a bit more with the weakloop concept. As Tim reported 
earlier, there is no chance that the "weak" training can be optimized 
this way. There are just too many binary choices in the training, 
resulting in a very bad optimization field.

The "train automatically" mode that Tim proposed and that is much more 
stable runs way too slowly to work as a step in an optimization.

So: I'm back at timcv.py. I removed weakloop.py from the CVS, and added 
a new 'simplexloop.py' that takes a single option: '-c commandline'. The 
command line will then be repeatedly executed with different 
bayescustomize.ini values, optimizing the cost that is reported as the 
third word of the last line of the output.

Obviously, I needed to change the output of timcv.py to report the 
flexcost, and that I did by introducing a generic CostCounter class 
which is in its own module.

I am currently running:

   python2.3 simplexloop.py -c 'python2.3 timcv.py -n 10 \
      --spam-keep=600 --ham-keep=600 -s 12345' > simplexloop.out

But I'm so curious about other peoples results that I've already 
committed this before letting it run to completion. During the small 
test runs I did make, I learned that even this cost function has very 
sharp edges. I think this is caused by very often occurring wordprobs 
that are either used or not used by a small step in 'min_prob_strength' 
or one of the other parameters. I think this is harmless if the training 
sets are large enough.


Rob W.W. Hooft  ||  rob@hooft.net  ||  http://www.hooft.net/people/rob/

More information about the Spambayes mailing list