[spambayes-dev] Testing Tools Changes

Sun Apr 25 19:50:31 EDT 2004

I recently created new testing corpora for myself and ran various tests.  As
part of this, I made various changes to the testing scripts to make things
easier.  I'd like to know if anyone thinks any of these are worth checking
in:

export.py (in the Outlook2000 directory): I added a command-line option to
skip printing the total number of messages that would be exported.  I didn't
really care what this number was, and generating it took a long time.
PRO: This number doesn't seem all that useful.
CON: This complicates a fairly simple script with another option.

export.py: I added a command-line option to only export messages that were
received via a certain account.  I wanted an automatic method of separating
out messages from a couple of accounts, and this seemed the easiest way.  It
compares the "Delivered" or "Envelope" header to the given regex and only
exports if it matches.  In addition, if the account is "Exchange", then it
only exports if it appears to be an Exchange message (missing those headers;
has the "X-Exchange-Junk" stuff.
PRO: This is a handy way to only get certain messages out of Outlook.
CON: This complicates the script a fair bit, and I haven't done any checking
to see how robust the Delivered/Envelope headers are (all I know is that all
my non-Exchange messages have one or the other of these).

msgstore.py (in the Outlook2000 directory): When creating the 'faked up'
Exchange headers, I added a "X-Exchange-Delivery-Time" header, which the
data from that Outlook property.  Without this, a lot of the exported
messages couldn't be sorted by the incremental testing stuff, so ended up at
the end, which isn't really accurate.
sort+group.py: If it can't find any received headers, check for a 
sort+"X-Exchange-Delivery-Time" header, and use that instead.
PRO: This is a very simple change, and doesn't have any effect on
classification, and improves the accuracy of incremental testing.
CON: This gets added every time that we add fake headers for an Exchange
message, and there is presumably a (very small, I think) cost involved with
that - this includes day-to-day use of the plug-in, when this has no effect
at all.

mksets.py: added -H and -S command-line options to specify an alternative
pair of directories to create the sets in, rather than being fixed to
"Data/Ham" and "Data/Spam".
PRO: This is more like the other scripts.
CON: ?

incremental.py: at the moment, it uses *all* mail in Data/ - I changed it to
use the TestDriver hamdir/spamdir options only (so that you can have
multiple corpora in the Data/ directory, but test only some of it).
PRO: Makes the incremental testing more like the timcv stuff which more
people are familiar with.  Also easier to use, IMO.
CON: Changes the way the script works, so could break existing testing
setups.

fpfn.py: added a command-line flag to also print out unsures (IIRC this
script predates unsures) as well as fp and fn.
PRO: Especially when one reaches the Peters barrier and has very few fp or
fn, looking at the unsures is interesting.
CON: Complicates a very simple script (there are no command-line options at
the moment) and don't fit the name (but having a 'fpfnunsure.py' script that
does this seems pointless).

I also changed fpfn.py to print out each message and offer to move it to the
corresponding ham/spam set (I used it to check for misclassified messages),
but it doesn't seem like this is a good addition to the script.

I also wrote a few scripts to process the incremental.py output, using both
mkgraph.py and Excel (via COM), so that I ended up with reasonably useful
spreadsheets.  If anyone is interested in these, let me know and I'll put
them somewhere (I don't think there's any point checking them in, though).

=Tony Meyer

=Tony Meyer