[Spambayes-checkins] spambayes/testtools incremental.py, 1.6, 1.7 regimes.py, 1.4, 1.5

Sat Jan 10 20:34:39 EST 2004

Update of /cvsroot/spambayes/spambayes/testtools
In directory sc8-pr-cvs1:/tmp/cvs-serv30522/testtools

Modified Files:
	incremental.py regimes.py 
Log Message:
Add a docstring and the ability to print it with -h or --help to incremental.py
(interestingly, it already checked for --help and --examples, but neither did anything).

Add a docstring to regimes.py that outlines the various regimes in hopefully
easy to understand terms (based on a spambayes-dev post by Alex). Print this
out if regimes.py is executed.

Add a new regime - balanced_corrected.  Details have been on spambayes-dev
and are in the docstring.

Index: incremental.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/testtools/incremental.py,v
retrieving revision 1.6
retrieving revision 1.7
diff -C2 -d -r1.6 -r1.7
*** incremental.py	16 Dec 2003 05:06:34 -0000	1.6
--- incremental.py	11 Jan 2004 01:34:36 -0000	1.7
***************
*** 1,2 ****
--- 1,14 ----
+ """incremental.py
+ 
+ This is a test harness for doing testing of incremental
+ training regimes.  The individual regimes used should
+ be specified in regime.py.
+ 
+ Options:
+   -h  --help         Display this message.
+   -r [regime]        Use this regime (default: perfect).
+   -s [number]        Run only this set.
+ """
+ 
  ###
  ### This is a test harness for doing testing of incremental
***************
*** 285,294 ****
      which = None

!     opts, args = getopt.getopt(sys.argv[1:], 's:r:', ['help', 'examples'])
      for opt, arg in opts:
          if opt == '-s':
              which = int(arg) - 1
!         if opt == '-r':
              regime = arg

      nsets = len(glob.glob("Data/Ham/Set*"))
--- 297,309 ----
      which = None

!     opts, args = getopt.getopt(sys.argv[1:], 'hs:r:', ['help', 'examples'])
      for opt, arg in opts:
          if opt == '-s':
              which = int(arg) - 1
!         elif opt == '-r':
              regime = arg
+         elif opt == '-h' or opt == '--help':
+             print __doc__
+             sys.exit()

      nsets = len(glob.glob("Data/Ham/Set*"))

Index: regimes.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/testtools/regimes.py,v
retrieving revision 1.4
retrieving revision 1.5
diff -C2 -d -r1.4 -r1.5
*** regimes.py	22 Dec 2003 23:32:53 -0000	1.4
--- regimes.py	11 Jan 2004 01:34:36 -0000	1.5
***************
*** 1,2 ****
--- 1,37 ----
+ """regimes.py
+ 
+ This module is not executable - it contains regime definitions
+ for use with incremental.py.  Pass the name of any regime to
+ incremental.py with the "-r" switch, and it will be loaded from
+ this module.
+ 
+ Existing regimes are:
+   'perfect'       A train-on-everything regime.  The trainer is given
+                   perfect and immediate knowledge of the proper
+                   classification.
+   'corrected'     A train-on-everything regime.  The trainer trusts the
+                   classifier result until end-of-group, at which point
+                   all mistrained and non-trained items (fp, fn, and
+                   unsure) are corrected to be trained with their proper
+                   classification.
+   'balanced_corrected'
+                   A partial-training regime.  Works just like the
+                   'corrected' regime, except that if the database is
+                   imbalanced more than 2::1 (or 1::2) then messages are
+                   not used for training.
+   'expire4months' This is like 'perfect', except that messages are
+                   untrained after 120 groups have passed.
+   'nonedge'       A partial-training regime, which trains only on messages
+                   which are not properly classified with scores of 1.00 or
+                   0.00 (rounded).  All false positives and false negatives
+                   *are* trained.
+   'fpfnunsure'    A partial-training regime, which trains only on
+                   false positives, false negatives and unsures.
+   'fnunsure'      A partial-training regime, which trains only on
+                   false negatives and unsures.  This simulates, for
+                   example, a user who deletes all mail classified as spam
+                   without ever examining it for false positives.
+ """
+ 
  ###
  ### This is a training regime for the incremental.py harness.
***************
*** 52,55 ****
--- 87,118 ----
  ###
  ### This is a training regime for the incremental.py harness.
+ ### It does guess-based training on all messages, as long
+ ### as the ham::spam ratio stays roughly even (not more than 2::1),
+ ### followed by correction to perfect at the end of each group.
+ ###
+ 
+ class balanced_corrected(corrected):
+     ratio_maximum = 2.0
+     def guess_action(self, which, test, guess, actual, msg):
+         # In some situations, we just do the 'corrected' regime:
+         #   If we haven't trained any ham/spam (regardless of
+         #     the guess because if all we know is one, everything
+         #     will look like it).
+         #   If the guess is unsure.
+         if not (guess[0] == 0 or test.nham_trained == 0 or \
+                 test.nspam_trained == 0):
+             # Otherwise, we only train if it doesn't screw up the
+             # balance.
+             ratio = test.nham_trained / float(test.nspam_trained)
+             if ratio > self.ratio_maximum and guess[0] == 1:
+                 # Too much ham, and this is ham - don't train.
+                 return 0
+             elif ratio < (1/self.ratio_maximum) and guess[0] == -1:
+                 # Too much spam, and this is spam - don't train.
+                 return 0
+         return corrected(self, which, test, guess, actual, msg)
+ 
+ ###
+ ### This is a training regime for the incremental.py harness.
  ### It does perfect training for fp, fn, and unsures.
  ###
***************
*** 130,131 ****
--- 193,197 ----
              self.ham[0].append(msg)
          return actual
+ 
+ if __name__ == "__main__":
+     print __doc__