[spambayes-dev] Pure Python CV comparison proggy

Eli Stevens (WG.c) listsub at wickedgrey.com
Mon Jan 12 05:20:32 EST 2004


Hey all,

I've got a multi- .ini  frontend/wrapper to timcv.py attached as elicv.py
(all the diffs are needed for it to work; they are against CVS as of about
1:30am PST).  A quick summary of the diffs*:

- Options.py - changed the load_options function to take the "alternate"
parameter, set to default to None.  This allows load_options to be called
after loading the module to use a new .ini.  I didn't notice any adverse
side effects from doing this.

- timcv.py - added an "-i" option that allows a non-default, non-envvar .ini
file to be specified.  This file is loaded when options are parsed (see
above).  Also changed how main is called to allow calls like timcv.main(("-n
10 -i file.ini").split()) to work.  The basic pattern is:

def main(sys_argv):
    opts, args = getopt.getopt(sys_argv, 'abc:d:')
    #etc.

if __name__ == "__main__":
    main(sys.argv[1:])

- rates.py - also added the same def main(sys_argv) convention as above and
made the module no longer just doit ;) when loaded.

- cmp.py - much the same as rates.py.

The new elicv.py file takes a list of .ini files and runs all the timcv
comparison combinations on them in alphabetical order (ie. for a.ini and
b.ini,  a-b.txt is always the comparison file).  All of the file names that
it creates match the output of the Makefile, and so should be able to be
used interchangeably (but see the caveats below).

Is there an algorithmic way to determine if a certain CV run is better than
another?  I suspect that there isn't a hard'n'fast answer (more "I'll know
it when I see it"); consider this an opportunity to educate me.  :P

I haven't touched anything in the incremental / regime testing yet.

Things I don't like/find unsettling:
- I had to touch a lot of files that aren't "mine" to get what I wanted to
do done (easily).  Due to my relative unfamiliarity with the project, I'm
not sure if this is to be expected or how it will be recieved.

- The timestamp detection and conditional .txt file rebuilding aspects of
the Makefile solution aren't present.  My data is relatively small and I
favor the "run everything from scratch, just to make sure" approach, but
this could end up wasting a lot of CPU time.

- The resulting .txt files end up cluttering the testtools directory.  Easy
to fix, but for now I am duplicating the output of the Makefile exactly.

- I'm not sure exactly what should go in the call to main in:

if __name__ == "__main__":
    main(sys.argv[1:])

I like [1:] because whenever I call main(sys_argv) from outside, I don't
have to cook up a fake leading program name that would most likely be
discarded.  However, it means that I have to change how the sys.argv list is
used, as I did in cmp.py:

def main(sys_argv):
    f1n, f2n = sys_argv[0:2] # Used to be sys.argv[1:3]

I'm pretty new to Python, so I am unsure if there is a standard way of doing
things like this (I'd love discussion on these kinds of things, but this
probably isn't the forum).

- Along the same lines, the usage calls of the sub-programs aren't correct,
due to """Usage: %(program)s..."""

- cmp is also the name of a builtin.

- elicv.py isn't very descriptive or creative.  ;)

I welcome comments and criticisms!

Thanks, and now to bed...
Eli

[*] - I just used "cvs diff foo.py" to generate the diffs; are there other
arguments I could use that would be easier for others to work with?

--
Give a man some mud, and he plays for a day.
Teach a man to mud, and he plays for a lifetime.
WickedGrey.com uses SpamBayes on incoming email:
               http://spambayes.sourceforge.net/
                                              --
-------------- next part --------------
#!/usr/bin/env python

"""Usage: %(program)s [options] -n nsets file1.ini [file2.ini ...]

Where:
    -h
        Show usage and exit.

    -c
        Cleans files generated from file1.ini [and file2.ini ...]
        
    fileN.ini
        An alternate .ini file to load Options from; each will be compared to
        the others.
        
Note, all other CLI arguments are passed through to timcv.py after being
checked.

    -n int
        Number of Set directories (Data/Spam/Set1, ... and Data/Ham/Set1, ...).
        This is required.

If you only want to use some of the messages in each set,

    --HamTrain int
        The maximum number of msgs to use from each Ham set for training.
        The msgs are chosen randomly.  See also the -s option.

    --SpamTrain int
        The maximum number of msgs to use from each Spam set for training.
        The msgs are chosen randomly.  See also the -s option.

    --HamTest int
        The maximum number of msgs to use from each Ham set for testing.
        The msgs are chosen randomly.  See also the -s option.

    --SpamTest int
        The maximum number of msgs to use from each Spam set for testing.
        The msgs are chosen randomly.  See also the -s option.

    --ham-keep int
        The maximum number of msgs to use from each Ham set for testing
        and training. The msgs are chosen randomly.  See also the -s option.

    --spam-keep int
        The maximum number of msgs to use from each Spam set for testing
        and training. The msgs are chosen randomly.  See also the -s option.

    -s int
        A seed for the random number generator.  Has no effect unless
        at least on of {--ham-keep, --spam-keep} is specified.  If -s
        isn't specifed, the seed is taken from current time.

In addition, an attempt is made to merge bayescustomize.ini into the options.
If that exists, it can be used to change the settings in Options.options.
"""

import os
import sys
import getopt
import timcv
import rates
import cmp

program = sys.argv[0]

def usage(code, msg=''):
    """Print usage message and sys.exit(code)."""
    if msg:
        print >> sys.stderr, msg
        print >> sys.stderr
    print >> sys.stderr, __doc__ % globals()
    sys.exit(code)

def main(sys_argv):
    list_alternateOptions = []
    str_timcvSysArgv = ""
    bool_cleanAllFiles = False

    try:
        opts, args = getopt.getopt(sys_argv, 'hcn:s:',
                                   ['HamTrain=', 'SpamTrain=',
                                   'HamTest=', 'SpamTest=',
                                   'ham-keep=', 'spam-keep='])
    except getopt.error, msg:
        usage(1, msg)

    for opt, arg in opts:
        if opt == '-h':
            usage(0)
        elif opt == '-c':
            bool_cleanAllFiles = True
        else:
            str_timcvSysArgv += opt + " " + arg + " "

    for arg in args:
        list_alternateOptions.append(arg)


    if bool_cleanAllFiles:
        list_previousOptions = []
        list_alternateOptions.sort()
        for str_alternateOptionFile in list_alternateOptions:
                # note that this assumes files end with .ini - should be checked, but I'm lazy
            str_alternateBaseName = str_alternateOptionFile[:-4]
            try:
                os.remove(str_alternateBaseName + ".txt")
            except:
                pass
            
            try:
                os.remove(str_alternateBaseName + "s.txt")
            except:
                pass

            for str_previousOption in list_previousOptions:
                try:
                    os.remove(str_previousOption + "-" + str_alternateBaseName + ".txt")
                except:
                    pass

            list_previousOptions.append(str_alternateBaseName)
        
    else:
        list_previousOptions = []
        file_originalSysStdout = sys.stdout
        list_alternateOptions.sort()
        for str_alternateOptionFile in list_alternateOptions:
                # note that this assumes files end with .ini - should be checked, but I'm lazy
            str_alternateBaseName = str_alternateOptionFile[:-4]
            
            print >> file_originalSysStdout, "Calling timcv.py: " + str_timcvSysArgv + " -i " + str_alternateOptionFile
            sys.stdout = open(str_alternateBaseName + ".txt", 'w')
            timcv.main( (str_timcvSysArgv + " -i " + str_alternateOptionFile).split() )
            
            print >> file_originalSysStdout, "Calling rates.py: " + str_alternateBaseName + ".txt"
            #sys.stdout = open(str_alternateBaseName + "s.txt", 'w')
            sys.stdout = open("rates-junk.txt", 'w')
            rates.main((str_alternateBaseName + ".txt").split())

            for str_previousOption in list_previousOptions:
                print >> file_originalSysStdout, "Calling cmp.py: " + str_previousOption + "s.txt " + str_alternateBaseName + "s.txt"
                sys.stdout = open(str_previousOption + "-" + str_alternateBaseName + ".txt", 'w')
                cmp.main((str_previousOption + "s.txt " + str_alternateBaseName + "s.txt").split())

            list_previousOptions.append(str_alternateBaseName)
        os.remove("rates-junk.txt")

                

if __name__ == "__main__":
    main(sys.argv[1:])
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Options.py.diff
Type: application/octet-stream
Size: 435 bytes
Desc: not available
Url : http://mail.python.org/pipermail/spambayes-dev/attachments/20040112/6c980eec/Options.py.obj
-------------- next part --------------
A non-text attachment was scrubbed...
Name: rates.py.diff
Type: application/octet-stream
Size: 378 bytes
Desc: not available
Url : http://mail.python.org/pipermail/spambayes-dev/attachments/20040112/6c980eec/rates.py.obj
-------------- next part --------------
A non-text attachment was scrubbed...
Name: timcv.py.diff
Type: application/octet-stream
Size: 759 bytes
Desc: not available
Url : http://mail.python.org/pipermail/spambayes-dev/attachments/20040112/6c980eec/timcv.py.obj
-------------- next part --------------
A non-text attachment was scrubbed...
Name: cmp.py.diff
Type: application/octet-stream
Size: 3788 bytes
Desc: not available
Url : http://mail.python.org/pipermail/spambayes-dev/attachments/20040112/6c980eec/cmp.py.obj


More information about the spambayes-dev mailing list