[Spambayes-checkins] spambayes/testtools incremental.py, 1.6,
1.7 regimes.py, 1.4, 1.5
Tony Meyer
anadelonbrin at users.sourceforge.net
Sat Jan 10 20:34:39 EST 2004
Update of /cvsroot/spambayes/spambayes/testtools
In directory sc8-pr-cvs1:/tmp/cvs-serv30522/testtools
Modified Files:
incremental.py regimes.py
Log Message:
Add a docstring and the ability to print it with -h or --help to incremental.py
(interestingly, it already checked for --help and --examples, but neither did anything).
Add a docstring to regimes.py that outlines the various regimes in hopefully
easy to understand terms (based on a spambayes-dev post by Alex). Print this
out if regimes.py is executed.
Add a new regime - balanced_corrected. Details have been on spambayes-dev
and are in the docstring.
Index: incremental.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/testtools/incremental.py,v
retrieving revision 1.6
retrieving revision 1.7
diff -C2 -d -r1.6 -r1.7
*** incremental.py 16 Dec 2003 05:06:34 -0000 1.6
--- incremental.py 11 Jan 2004 01:34:36 -0000 1.7
***************
*** 1,2 ****
--- 1,14 ----
+ """incremental.py
+
+ This is a test harness for doing testing of incremental
+ training regimes. The individual regimes used should
+ be specified in regime.py.
+
+ Options:
+ -h --help Display this message.
+ -r [regime] Use this regime (default: perfect).
+ -s [number] Run only this set.
+ """
+
###
### This is a test harness for doing testing of incremental
***************
*** 285,294 ****
which = None
! opts, args = getopt.getopt(sys.argv[1:], 's:r:', ['help', 'examples'])
for opt, arg in opts:
if opt == '-s':
which = int(arg) - 1
! if opt == '-r':
regime = arg
nsets = len(glob.glob("Data/Ham/Set*"))
--- 297,309 ----
which = None
! opts, args = getopt.getopt(sys.argv[1:], 'hs:r:', ['help', 'examples'])
for opt, arg in opts:
if opt == '-s':
which = int(arg) - 1
! elif opt == '-r':
regime = arg
+ elif opt == '-h' or opt == '--help':
+ print __doc__
+ sys.exit()
nsets = len(glob.glob("Data/Ham/Set*"))
Index: regimes.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/testtools/regimes.py,v
retrieving revision 1.4
retrieving revision 1.5
diff -C2 -d -r1.4 -r1.5
*** regimes.py 22 Dec 2003 23:32:53 -0000 1.4
--- regimes.py 11 Jan 2004 01:34:36 -0000 1.5
***************
*** 1,2 ****
--- 1,37 ----
+ """regimes.py
+
+ This module is not executable - it contains regime definitions
+ for use with incremental.py. Pass the name of any regime to
+ incremental.py with the "-r" switch, and it will be loaded from
+ this module.
+
+ Existing regimes are:
+ 'perfect' A train-on-everything regime. The trainer is given
+ perfect and immediate knowledge of the proper
+ classification.
+ 'corrected' A train-on-everything regime. The trainer trusts the
+ classifier result until end-of-group, at which point
+ all mistrained and non-trained items (fp, fn, and
+ unsure) are corrected to be trained with their proper
+ classification.
+ 'balanced_corrected'
+ A partial-training regime. Works just like the
+ 'corrected' regime, except that if the database is
+ imbalanced more than 2::1 (or 1::2) then messages are
+ not used for training.
+ 'expire4months' This is like 'perfect', except that messages are
+ untrained after 120 groups have passed.
+ 'nonedge' A partial-training regime, which trains only on messages
+ which are not properly classified with scores of 1.00 or
+ 0.00 (rounded). All false positives and false negatives
+ *are* trained.
+ 'fpfnunsure' A partial-training regime, which trains only on
+ false positives, false negatives and unsures.
+ 'fnunsure' A partial-training regime, which trains only on
+ false negatives and unsures. This simulates, for
+ example, a user who deletes all mail classified as spam
+ without ever examining it for false positives.
+ """
+
###
### This is a training regime for the incremental.py harness.
***************
*** 52,55 ****
--- 87,118 ----
###
### This is a training regime for the incremental.py harness.
+ ### It does guess-based training on all messages, as long
+ ### as the ham::spam ratio stays roughly even (not more than 2::1),
+ ### followed by correction to perfect at the end of each group.
+ ###
+
+ class balanced_corrected(corrected):
+ ratio_maximum = 2.0
+ def guess_action(self, which, test, guess, actual, msg):
+ # In some situations, we just do the 'corrected' regime:
+ # If we haven't trained any ham/spam (regardless of
+ # the guess because if all we know is one, everything
+ # will look like it).
+ # If the guess is unsure.
+ if not (guess[0] == 0 or test.nham_trained == 0 or \
+ test.nspam_trained == 0):
+ # Otherwise, we only train if it doesn't screw up the
+ # balance.
+ ratio = test.nham_trained / float(test.nspam_trained)
+ if ratio > self.ratio_maximum and guess[0] == 1:
+ # Too much ham, and this is ham - don't train.
+ return 0
+ elif ratio < (1/self.ratio_maximum) and guess[0] == -1:
+ # Too much spam, and this is spam - don't train.
+ return 0
+ return corrected(self, which, test, guess, actual, msg)
+
+ ###
+ ### This is a training regime for the incremental.py harness.
### It does perfect training for fp, fn, and unsures.
###
***************
*** 130,131 ****
--- 193,197 ----
self.ham[0].append(msg)
return actual
+
+ if __name__ == "__main__":
+ print __doc__
More information about the Spambayes-checkins
mailing list