[spambayes-dev] Incremental training results

Eli Stevens (WG.c) listsub at wickedgrey.com
Fri Jan 9 17:39:12 EST 2004

T. Alexander Popiel wrote:

> Argh.  Most of the confusion arises from a complete lack of
> documentation on the interface to the regimes: what their
> parameters mean, what the return code means, etc.  I'll try
> to get to that soon... unless someone beats me to it.  Reading
> incremental.py is pretty much required until such docs get
> written.

Somewhat tangential, but...

Last night I set up the default Data/{Ham,Spam}/SetN testing structure 
and was able to run incremental.py (with the balance_corrected regime 
added) on the lot of it.  I have 164 * 10 ham and 54 * 10 spam.  The 
spam rate has increased steadily since I started collecting - the first 
10 spam took 100 days to come in (ahh the joys of a private domain name 
and practicing safe computing!  Alas, those days are no more).  I used a 
modified version of the dotest.sh script to run each set against each 
regime, which produced 70 graphs that, while nice, don't allow for easy 
comparative analysis*.

The docs in the timtest.py and timcv.py don't imply any easy/automatic 
way to change .ini settings or regimes (I haven't gone through the code 
yet, however), but seem to be the standard for assessing the impact of a 
change to the tokenizer, etc.

I'm wanting to cook up something that will take a list of .ini files (or 
Option objects, if I understand correctly - they are equivalent?) and a 
list of regimes and run all the combinations, outputting a few pretty 
graphs.  The end goal is to produce a suite that easily tells a) what 
effect a regime change has on a range of .ini settings (or the reverse, 
an .ini change has on the various regimes) and more pragmatically b) 
what the "best" .ini options and regime are for my mail stream.  We'll 
see how much happens this weekend.  :)

Any suggestions, ideas for features, pointers, etc.?


[*] - Though a few spikes in the FP line did lead me to find a few spam 
in my ham corpus that I had missed previously.  ;)

More information about the spambayes-dev mailing list