[Spambayes] Better optimization loop
Fri Nov 15 21:53:47 2002
I've been playing a bit more with the weakloop concept. As Tim reported
earlier, there is no chance that the "weak" training can be optimized
this way. There are just too many binary choices in the training,
resulting in a very bad optimization field.
The "train automatically" mode that Tim proposed and that is much more
stable runs way too slowly to work as a step in an optimization.
So: I'm back at timcv.py. I removed weakloop.py from the CVS, and added
a new 'simplexloop.py' that takes a single option: '-c commandline'. The
command line will then be repeatedly executed with different
bayescustomize.ini values, optimizing the cost that is reported as the
third word of the last line of the output.
Obviously, I needed to change the output of timcv.py to report the
flexcost, and that I did by introducing a generic CostCounter class
which is in its own module.
I am currently running:
python2.3 simplexloop.py -c 'python2.3 timcv.py -n 10 \
--spam-keep=600 --ham-keep=600 -s 12345' > simplexloop.out
But I'm so curious about other peoples results that I've already
committed this before letting it run to completion. During the small
test runs I did make, I learned that even this cost function has very
sharp edges. I think this is caused by very often occurring wordprobs
that are either used or not used by a small step in 'min_prob_strength'
or one of the other parameters. I think this is harmless if the training
sets are large enough.
Rob W.W. Hooft || firstname.lastname@example.org || http://www.hooft.net/people/rob/
More information about the Spambayes