[Spambayes-checkins] spambayes Options.py,1.70,1.71 classifier.py,1.50,1.51

Mon Nov 18 01:40:06 2002

Update of /cvsroot/spambayes/spambayes
In directory sc8-pr-cvs1:/tmp/cvs-serv24664

Modified Files:
	Options.py classifier.py 
Log Message:
Added option experimental_ham_spam_imbalance_adjustment.  Please test!
Especially if you train on a lot more ham than spam (or vice versa).


Index: Options.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Options.py,v
retrieving revision 1.70
retrieving revision 1.71
diff -C2 -d -r1.70 -r1.71
*** Options.py	13 Nov 2002 18:14:32 -0000	1.70
--- Options.py	18 Nov 2002 01:40:03 -0000	1.71
***************
*** 298,301 ****
--- 298,315 ----
  use_chi_squared_combining: True
  
+ # If the # of ham and spam in training data are out of balance, the
+ # spamprob guesses can get stronger in the direction of the category with
+ # more training msgs.  In one sense this must be so, since the more data
+ # we have of one flavor, the more we know about that flavor.  But that
+ # allows the accidental appearance of a strong word of that flavor in a msg
+ # of the other flavor much more power than an accident in the other
+ # direction.  Enable experimental_ham_spam_imbalance_adjustment if you have
+ # more ham than spam training data (or more spam than ham), and the
+ # Bayesian probability adjustment won't 'believe' raw counts more than
+ # min(# ham trained on, # spam trained on) justifies.  I *expect* this
+ # option will go away (and become the default), but people *with* strong
+ # imbalance need to test it first.
+ experimental_ham_spam_imbalance_adjustment: False
+ 
  [Hammie]
  # The name of the header that hammie adds to an E-mail in filter mode
***************
*** 410,414 ****
                     'use_gary_combining': boolean_cracker,
                     'use_chi_squared_combining': boolean_cracker,
!                    },
      'Hammie': {'hammie_header_name': string_cracker,
                 'persistent_storage_file': string_cracker,
--- 424,429 ----
                     'use_gary_combining': boolean_cracker,
                     'use_chi_squared_combining': boolean_cracker,
!                    'experimental_ham_spam_imbalance_adjustment': boolean_cracker,
!                   },
      'Hammie': {'hammie_header_name': string_cracker,
                 'persistent_storage_file': string_cracker,

Index: classifier.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/classifier.py,v
retrieving revision 1.50
retrieving revision 1.51
diff -C2 -d -r1.50 -r1.51
*** classifier.py	11 Nov 2002 01:59:06 -0000	1.50
--- classifier.py	18 Nov 2002 01:40:04 -0000	1.51
***************
*** 322,330 ****
          nspam = float(self.nspam or 1)
  
          S = options.unknown_word_strength
          StimesX = S * options.unknown_word_prob
  
          for word, record in self.wordinfo.iteritems():
!             # Compute prob(msg is spam | msg contains word).
              # This is the Graham calculation, but stripped of biases, and
              # stripped of clamping into 0.01 thru 0.99.  The Bayesian
--- 322,336 ----
          nspam = float(self.nspam or 1)
  
+         if options.experimental_ham_spam_imbalance_adjustment:
+             spam2ham = min(nspam / nham, 1.0)
+             ham2spam = min(nham / nspam, 1.0)
+         else:
+             spam2ham = ham2spam = 1.0
+ 
          S = options.unknown_word_strength
          StimesX = S * options.unknown_word_prob
  
          for word, record in self.wordinfo.iteritems():
!             # Compute p(word) = prob(msg is spam | msg contains word).
              # This is the Graham calculation, but stripped of biases, and
              # stripped of clamping into 0.01 thru 0.99.  The Bayesian
***************
*** 358,362 ****
              # less so the larger n is, or the smaller s is.
  
!             n = hamcount + spamcount
              prob = (StimesX + n * prob) / (S + n)
  
--- 364,386 ----
              # less so the larger n is, or the smaller s is.
  
!             # Experimental:
!             # Picking a good value for n is interesting:  how much empirical
!             # evidence do we really have?  If nham == nspam,
!             # hamcount + spamcount makes a lot of sense, and the code here
!             # does that by default.
!             # But if, e.g., nham is much larger than nspam, p(w) can get a
!             # lot closer to 0.0 than it can get to 1.0.  That in turn makes
!             # strong ham words (high hamcount) much stronger than strong
!             # spam words (high spamcount), and that makes the accidental
!             # appearance of a strong ham word in spam much more damaging than
!             # the accidental appearance of a strong spam word in ham.
!             # So we don't give hamcount full credit when nham > nspam (or
!             # spamcount when nspam > nham):  instead we knock hamcount down
!             # to what it would have been had nham been equal to nspam.  IOW,
!             # we multiply hamcount by nspam/nham when nspam < nham; or, IOOW,
!             # we don't "believe" any count to an extent more than
!             # min(nspam, nham) justifies.
! 
!             n = hamcount * spam2ham  +  spamcount * ham2spam
              prob = (StimesX + n * prob) / (S + n)