[Spambayes-checkins]
spambayes Options.py,1.57,1.58 classifier.py,1.43,1.44
Tim Peters
tim_one@users.sourceforge.net
Sun Oct 27 03:43:00 2002
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv22765
Modified Files:
Options.py classifier.py
Log Message:
Make chi-combining the default. Add
[Classifier]
use_chi_combining: False
use_gary_combining: True
if you want to use the former default for scoring. The combining scheme
is purely a scoring-time decision. It has no effect on training; there's
no need to retrain your database.
Index: Options.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Options.py,v
retrieving revision 1.57
retrieving revision 1.58
diff -C2 -d -r1.57 -r1.58
*** Options.py 26 Oct 2002 16:15:38 -0000 1.57
--- Options.py 27 Oct 2002 03:42:58 -0000 1.58
***************
*** 111,120 ****
# ham_cutoff > spam_cutoff doesn't make sense.
#
! # The defaults are for the all-default Robinson scheme, which makes a
! # binary decision with no middle ground. The precise value that works
! # best is corpus-dependent, and values into the .600's have been known
! # to work best on some data.
! ham_cutoff: 0.560
! spam_cutoff: 0.560
# Number of buckets in histograms.
--- 111,128 ----
# ham_cutoff > spam_cutoff doesn't make sense.
#
! # The defaults here (.2 and .9) may be appropriate for the default chi-
! # combining scheme. Cutoffs for chi-combining typically aren't touchy,
! # provided you're willing to settle for "really good" instead of "optimal".
! # Tim found that .3 and .8 worked very well for well-trained systems on
! # his personal email, and his large comp.lang.python test. If just beginning
! # training, or extremely fearful of mistakes, 0.05 and 0.95 may be more
! # appropriate for you.
! #
! # Picking good values for gary-combining is much harder, and appears to be
! # corpus-dependent, and within a single corpus dependent on how much
! # training has been done. Values from 0.50 thru the low 0.60's have been
! # reported to work best by various testers on their data.
! ham_cutoff: 0.20
! spam_cutoff: 0.90
# Number of buckets in histograms.
***************
*** 244,248 ****
# scores (near 0.0 or 1.0), but the tail ends of the ham and spam
# distributions overlap.
! use_gary_combining: True
# For vectors of random, uniformly distributed probabilities, -2*sum(ln(p_i))
--- 252,256 ----
# scores (near 0.0 or 1.0), but the tail ends of the ham and spam
# distributions overlap.
! use_gary_combining: False
# For vectors of random, uniformly distributed probabilities, -2*sum(ln(p_i))
***************
*** 263,267 ****
# with ham_cutoff=0.30 and spam_cutoff=0.80 across three test data sets
# (original c.l.p data, his own email, and newer general python.org traffic).
! use_chi_squared_combining: False
"""
--- 271,275 ----
# with ham_cutoff=0.30 and spam_cutoff=0.80 across three test data sets
# (original c.l.p data, his own email, and newer general python.org traffic).
! use_chi_squared_combining: True
"""
Index: classifier.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/classifier.py,v
retrieving revision 1.43
retrieving revision 1.44
diff -C2 -d -r1.43 -r1.44
*** classifier.py 26 Oct 2002 16:01:14 -0000 1.43
--- classifier.py 27 Oct 2002 03:42:58 -0000 1.44
***************
*** 9,14 ****
# rates over Paul's original description.
#
! # This code implements Gary Robinson's suggestions, which are well explained
! # on his webpage:
#
# http://radio.weblogs.com/0101454/stories/2002/09/16/spamDetection.html
--- 9,14 ----
# rates over Paul's original description.
#
! # This code implements Gary Robinson's suggestions, the core of which are
! # well explained on his webpage:
#
# http://radio.weblogs.com/0101454/stories/2002/09/16/spamDetection.html
***************
*** 19,29 ****
# the scores under Paul's scheme were almost always very near 0 or very near
# 1, whether or not the classification was correct. The false positives
! # and false negatives under Gary's scheme generally score in a narrow range
! # around the corpus's best spam_cutoff value.
#
! # The chi-combining scheme here gets closer to the theoretical basis of
! # Gary's combining scheme, and does give extreme scores, but also has a
! # very useful middle ground (small # of msgs spread across a large range
! # of scores).
#
# This implementation is due to Tim Peters et alia.
--- 19,31 ----
# the scores under Paul's scheme were almost always very near 0 or very near
# 1, whether or not the classification was correct. The false positives
! # and false negatives under Gary's basic scheme (use_gary_combining) generally
! # score in a narrow range around the corpus's best spam_cutoff value.
! # However, it doesn't appear possible to guess the best spam_cutoff value in
! # advance, and it's touchy.
#
! # The chi-combining scheme used by default here gets closer to the theoretical
! # basis of Gary's combining scheme, and does give extreme scores, but also
! # has a very useful middle ground (small # of msgs spread across a large range
! # of scores, and good cutoff values aren't touchy).
#
# This implementation is due to Tim Peters et alia.
***************
*** 34,45 ****
from Options import options
!
! if options.use_chi_squared_combining:
! from chi2 import chi2Q
! LN2 = math.log(2)
!
! # The maximum number of extreme words to look at in a msg, where "extreme"
! # means with spamprob farthest away from 0.5.
! MAX_DISCRIMINATORS = options.max_discriminators # 150
PICKLE_VERSION = 1
--- 36,41 ----
from Options import options
! from chi2 import chi2Q
! LN2 = math.log(2) # used frequently by chi-combining
PICKLE_VERSION = 1