[Spambayes-checkins] spambayes Options.py,1.57,1.58 classifier.py,1.43,1.44

Tim Peters tim_one@users.sourceforge.net
Sun Oct 27 03:43:00 2002


Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv22765

Modified Files:
	Options.py classifier.py 
Log Message:
Make chi-combining the default.  Add

[Classifier]
use_chi_combining: False
use_gary_combining: True

if you want to use the former default for scoring.  The combining scheme
is purely a scoring-time decision.  It has no effect on training; there's
no need to retrain your database.


Index: Options.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Options.py,v
retrieving revision 1.57
retrieving revision 1.58
diff -C2 -d -r1.57 -r1.58
*** Options.py	26 Oct 2002 16:15:38 -0000	1.57
--- Options.py	27 Oct 2002 03:42:58 -0000	1.58
***************
*** 111,120 ****
  # ham_cutoff > spam_cutoff doesn't make sense.
  #
! # The defaults are for the all-default Robinson scheme, which makes a
! # binary decision with no middle ground.  The precise value that works
! # best is corpus-dependent, and values into the .600's have been known
! # to work best on some data.
! ham_cutoff:  0.560
! spam_cutoff: 0.560
  
  # Number of buckets in histograms.
--- 111,128 ----
  # ham_cutoff > spam_cutoff doesn't make sense.
  #
! # The defaults here (.2 and .9) may be appropriate for the default chi-
! # combining scheme.  Cutoffs for chi-combining typically aren't touchy,
! # provided you're willing to settle for "really good" instead of "optimal".
! # Tim found that .3 and .8 worked very well for well-trained systems on
! # his personal email, and his large comp.lang.python test.  If just beginning
! # training, or extremely fearful of mistakes, 0.05 and 0.95 may be more
! # appropriate for you.
! #
! # Picking good values for gary-combining is much harder, and appears to be
! # corpus-dependent, and within a single corpus dependent on how much
! # training has been done.  Values from 0.50 thru the low 0.60's have been
! # reported to work best by various testers on their data.
! ham_cutoff:  0.20
! spam_cutoff: 0.90
  
  # Number of buckets in histograms.
***************
*** 244,248 ****
  # scores (near 0.0 or 1.0), but the tail ends of the ham and spam
  # distributions overlap.
! use_gary_combining: True
  
  # For vectors of random, uniformly distributed probabilities, -2*sum(ln(p_i))
--- 252,256 ----
  # scores (near 0.0 or 1.0), but the tail ends of the ham and spam
  # distributions overlap.
! use_gary_combining: False
  
  # For vectors of random, uniformly distributed probabilities, -2*sum(ln(p_i))
***************
*** 263,267 ****
  # with ham_cutoff=0.30 and spam_cutoff=0.80 across three test data sets
  # (original c.l.p data, his own email, and newer general python.org traffic).
! use_chi_squared_combining: False
  """
  
--- 271,275 ----
  # with ham_cutoff=0.30 and spam_cutoff=0.80 across three test data sets
  # (original c.l.p data, his own email, and newer general python.org traffic).
! use_chi_squared_combining: True
  """
  

Index: classifier.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/classifier.py,v
retrieving revision 1.43
retrieving revision 1.44
diff -C2 -d -r1.43 -r1.44
*** classifier.py	26 Oct 2002 16:01:14 -0000	1.43
--- classifier.py	27 Oct 2002 03:42:58 -0000	1.44
***************
*** 9,14 ****
  # rates over Paul's original description.
  #
! # This code implements Gary Robinson's suggestions, which are well explained
! # on his webpage:
  #
  #    http://radio.weblogs.com/0101454/stories/2002/09/16/spamDetection.html
--- 9,14 ----
  # rates over Paul's original description.
  #
! # This code implements Gary Robinson's suggestions, the core of which are
! # well explained on his webpage:
  #
  #    http://radio.weblogs.com/0101454/stories/2002/09/16/spamDetection.html
***************
*** 19,29 ****
  # the scores under Paul's scheme were almost always very near 0 or very near
  # 1, whether or not the classification was correct.  The false positives
! # and false negatives under Gary's scheme generally score in a narrow range
! # around the corpus's best spam_cutoff value.
  #
! # The chi-combining scheme here gets closer to the theoretical basis of
! # Gary's combining scheme, and does give extreme scores, but also has a
! # very useful middle ground (small # of msgs spread across a large range
! # of scores).
  #
  # This implementation is due to Tim Peters et alia.
--- 19,31 ----
  # the scores under Paul's scheme were almost always very near 0 or very near
  # 1, whether or not the classification was correct.  The false positives
! # and false negatives under Gary's basic scheme (use_gary_combining) generally
! # score in a narrow range around the corpus's best spam_cutoff value.
! # However, it doesn't appear possible to guess the best spam_cutoff value in
! # advance, and it's touchy.
  #
! # The chi-combining scheme used by default here gets closer to the theoretical
! # basis of Gary's combining scheme, and does give extreme scores, but also
! # has a very useful middle ground (small # of msgs spread across a large range
! # of scores, and good cutoff values aren't touchy).
  #
  # This implementation is due to Tim Peters et alia.
***************
*** 34,45 ****
  
  from Options import options
! 
! if options.use_chi_squared_combining:
!     from chi2 import chi2Q
!     LN2 = math.log(2)
! 
! # The maximum number of extreme words to look at in a msg, where "extreme"
! # means with spamprob farthest away from 0.5.
! MAX_DISCRIMINATORS = options.max_discriminators # 150
  
  PICKLE_VERSION = 1
--- 36,41 ----
  
  from Options import options
! from chi2 import chi2Q
! LN2 = math.log(2)       # used frequently by chi-combining
  
  PICKLE_VERSION = 1