[Spambayes] statistical comparison of enviroment?

T. Alexander Popiel popiel at wolfskeep.com
Fri Mar 7 16:52:49 EST 2003


In message:  <3E69224D.8010103 at parducci.net>
             bill parducci <bill at parducci.net> writes:
>
>T. Alexander Popiel wrote:
>> We've actually got a pretty good testing infrastructure set up;
>> for tokenization tests, I personally use timcv.py with each of the
>> tokenization options and then feed the output of the runs into
>> table.py.  This produces some nice tabularizations that you may
>> notice in the mailing list archives.
>
>by any chance do you have an example of how this is initiated? (fyi: it
>seems that there is an issue with the command line 'help' option.)

Argh.  You're running into the same problem I did originally, due to the
testing stuff being in a subdir and the spambayes stuff not being on your
python path.  This is perhaps one of the most annoying bits about the
system.

I just checked in a fix to timcv.py which appropriately mangles the
python path before trying to import the spambayes stuff.  I don't
think this will break anybody... if it does, please tell me the proper
way to mangle the python path for an unprivileged user.  Remember,
I'm a relative python newbie, too.


As to more general instructions:

1. Set up your corpora in subdirectories named Data/Ham/reservoir and
   Data/Spam/reservoir, with one message per file.  The splitndirs.py
   under utilities may of help here if you're starting from mboxes,
   or es2hs.py under testtools if you're starting from an MH setup
   like mine.

2. If you're going to do any incremental testing, sort and group the
   corpora with sort+group.py.

3. Decide how many sets you want for your cross-validation.  Personally,
   I use 5.  Then use either rebal.py (from the utilities) or mksets.py
   (from testtools) to populate the sets, depending on whether or not
   you chose to sort+group... mksets.py doesn't like filenames not in the
   special format for incremental testing.

4. Set up an .ini file with whatever options you want to use as baseline.
   Set the BAYESCUSTOMIZE environment variable to that .ini file, then
   run timcv.py and capture the output.

5. Set up another .ini file with whatever options you want to test.
   Set the BAYESCUSTOMIZE environment variable to that .ini file, then
   run timcv.py again and capture the output to a different file.

6. Run table.py on the two output files from timcv.py.  Mail the results
   to the list. :-)

Enjoy.

- Alex



More information about the Spambayes mailing list