[Spambayes-checkins] spambayes Options.py,1.67,1.68 classifier.py,1.49,1.50 weakloop.py,1.1,1.2

Mon Nov 11 01:59:08 2002

Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv5402

Modified Files:
	Options.py classifier.py weakloop.py 
Log Message:
For the benefit of future generations, renamed some options:

Old                             New
---                             ---
robinson_probability_x          unknown_word_prob
robinson_probability_s          unknown_word_strength
robinson_minimum_prob_strength  minimum_prob_strength

Index: Options.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Options.py,v
retrieving revision 1.67
retrieving revision 1.68
diff -C2 -d -r1.67 -r1.68
*** Options.py	8 Nov 2002 04:06:23 -0000	1.67
--- Options.py	11 Nov 2002 01:59:06 -0000	1.68
***************
*** 241,268 ****

  # These two control the prior assumption about word probabilities.
! # "x" is essentially the probability given to a word that has never been
! # seen before.  Nobody has reported an improvement via moving it away
! # from 1/2.
! # "s" adjusts how much weight to give the prior assumption relative to
! # the probabilities estimated by counting.  At s=0, the counting estimates
! # are believed 100%, even to the extent of assigning certainty (0 or 1)
! # to a word that has appeared in only ham or only spam.  This is a disaster.
! # As s tends toward infintity, all probabilities tend toward x.  All
! # reports were that a value near 0.4 worked best, so this does not seem to
! # be corpus-dependent.
! # NOTE:  Gary Robinson previously used a different formula involving 'a'
! # and 'x'.  The 'x' here is the same as before.  The 's' here is the old
! # 'a' divided by 'x'.
! robinson_probability_x: 0.5
! robinson_probability_s: 0.45

  # When scoring a message, ignore all words with
! # abs(word.spamprob - 0.5) < robinson_minimum_prob_strength.
  # This may be a hack, but it has proved to reduce error rates in many
! # tests over Robinsons base scheme.  0.1 appeared to work well across
! # all corpora.
! robinson_minimum_prob_strength: 0.1

! # The combining scheme currently detailed on Gary Robinons web page.
  # The middle ground here is touchy, varying across corpus, and within
  # a corpus across amounts of training data.  It almost never gives extreme
--- 241,268 ----

  # These two control the prior assumption about word probabilities.
! # unknown_word_prob is essentially the probability given to a word that
! # has never been seen before.  Nobody has reported an improvement via moving
! # it away from 1/2, although Tim has measured a mean spamprob of a bit over
! # 0.5 (0.51-0.55) in 3 well-trained classifiers.
! #
! # unknown_word_strength adjusts how much weight to give the prior assumption
! # relative to the probabilities estimated by counting.  At 0, the counting
! # estimates are believed 100%, even to the extent of assigning certainty
! # (0 or 1) to a word that has appeared in only ham or only spam.  This
! # is a disaster.
! #
! # As unknown_word_strength tends toward infintity, all probabilities tend
! # toward unknown_word_prob.  All reports were that a value near 0.4 worked
! # best, so this does not seem to be corpus-dependent.
! unknown_word_prob: 0.5
! unknown_word_strength: 0.45

  # When scoring a message, ignore all words with
! # abs(word.spamprob - 0.5) < minimum_prob_strength.
  # This may be a hack, but it has proved to reduce error rates in many
! # tests.  0.1 appeared to work well across all corpora.
! minimum_prob_strength: 0.1

! # The combining scheme currently detailed on the Robinon web page.
  # The middle ground here is touchy, varying across corpus, and within
  # a corpus across amounts of training data.  It almost never gives extreme
***************
*** 272,284 ****

  # For vectors of random, uniformly distributed probabilities, -2*sum(ln(p_i))
! # follows the chi-squared distribution with 2*n degrees of freedom.  That is
! # the "provably most-sensitive" test Garys original scheme was monotonic
  # with.  Getting closer to the theoretical basis appears to give an excellent
  # combining method, usually very extreme in its judgment, yet finding a tiny
  # (in # of msgs, spread across a huge range of scores) middle ground where
! # lots of the mistakes live.  This is the best method so far on Tims data.
! # One systematic benefit is that it is immune to "cancellation disease".  One
! # systematic drawback is that it is sensitive to *any* deviation from a
! # uniform distribution, regardless of whether that is actually evidence of
  # ham or spam.  Rob Hooft alleviated that by combining the final S and H
  # measures via (S-H+1)/2 instead of via S/(S+H)).
--- 272,284 ----

  # For vectors of random, uniformly distributed probabilities, -2*sum(ln(p_i))
! # follows the chi-squared distribution with 2*n degrees of freedom.  This is
! # the "provably most-sensitive" test the original scheme was monotonic
  # with.  Getting closer to the theoretical basis appears to give an excellent
  # combining method, usually very extreme in its judgment, yet finding a tiny
  # (in # of msgs, spread across a huge range of scores) middle ground where
! # lots of the mistakes live.  This is the best method so far.
! # One systematic benefit is is immunity to "cancellation disease".  One
! # systematic drawback is sensitivity to *any* deviation from a
! # uniform distribution, regardless of whether actually evidence of
  # ham or spam.  Rob Hooft alleviated that by combining the final S and H
  # measures via (S-H+1)/2 instead of via S/(S+H)).
***************
*** 381,387 ****
                   },
      'Classifier': {'max_discriminators': int_cracker,
!                    'robinson_probability_x': float_cracker,
!                    'robinson_probability_s': float_cracker,
!                    'robinson_minimum_prob_strength': float_cracker,
                     'use_gary_combining': boolean_cracker,
                     'use_chi_squared_combining': boolean_cracker,
--- 381,387 ----
                   },
      'Classifier': {'max_discriminators': int_cracker,
!                    'unknown_word_prob': float_cracker,
!                    'unknown_word_strength': float_cracker,
!                    'minimum_prob_strength': float_cracker,
                     'use_gary_combining': boolean_cracker,
                     'use_chi_squared_combining': boolean_cracker,

Index: classifier.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/classifier.py,v
retrieving revision 1.49
retrieving revision 1.50
diff -C2 -d -r1.49 -r1.50
*** classifier.py	7 Nov 2002 22:30:05 -0000	1.49
--- classifier.py	11 Nov 2002 01:59:06 -0000	1.50
***************
*** 70,74 ****
      # a word is no longer being used, it's just wasting space.

!     def __init__(self, atime, spamprob=options.robinson_probability_x):
          self.atime = atime
          self.spamcount = self.hamcount = self.killcount = 0
--- 70,74 ----
      # a word is no longer being used, it's just wasting space.

!     def __init__(self, atime, spamprob=options.unknown_word_prob):
          self.atime = atime
          self.spamcount = self.hamcount = self.killcount = 0
***************
*** 322,327 ****
          nspam = float(self.nspam or 1)

!         S = options.robinson_probability_s
!         StimesX = S * options.robinson_probability_x

          for word, record in self.wordinfo.iteritems():
--- 322,327 ----
          nspam = float(self.nspam or 1)

!         S = options.unknown_word_strength
!         StimesX = S * options.unknown_word_prob

          for word, record in self.wordinfo.iteritems():
***************
*** 449,454 ****

      def _getclues(self, wordstream):
!         mindist = options.robinson_minimum_prob_strength
!         unknown = options.robinson_probability_x

          clues = []  # (distance, prob, word, record) tuples
--- 449,454 ----

      def _getclues(self, wordstream):
!         mindist = options.minimum_prob_strength
!         unknown = options.unknown_word_prob

          clues = []  # (distance, prob, word, record) tuples

Index: weakloop.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/weakloop.py,v
retrieving revision 1.1
retrieving revision 1.2
diff -C2 -d -r1.1 -r1.2
*** weakloop.py	10 Nov 2002 12:08:40 -0000	1.1
--- weakloop.py	11 Nov 2002 01:59:06 -0000	1.2
***************
*** 29,35 ****
  default="""
  [Classifier]
! robinson_probability_x = 0.5
! robinson_minimum_prob_strength = 0.1
! robinson_probability_s = 0.45
  max_discriminators = 150

--- 29,35 ----
  default="""
  [Classifier]
! unknown_word_prob = 0.5
! minimum_prob_strength = 0.1
! unknown_word_strength = 0.45
  max_discriminators = 150

***************
*** 41,47 ****
  import Options

! start = (Options.options.robinson_probability_x,
!          Options.options.robinson_minimum_prob_strength,
!          Options.options.robinson_probability_s,
           Options.options.spam_cutoff,
           Options.options.ham_cutoff)
--- 41,47 ----
  import Options

! start = (Options.options.unknown_word_prob,
!          Options.options.minimum_prob_strength,
!          Options.options.unknown_word_strength,
           Options.options.spam_cutoff,
           Options.options.ham_cutoff)
***************
*** 52,58 ****
      f.write("""
  [Classifier]
! robinson_probability_x = %.6f
! robinson_minimum_prob_strength = %.6f
! robinson_probability_s = %.6f

  [TestDriver]
--- 52,58 ----
      f.write("""
  [Classifier]
! unknown_word_prob = %.6f
! minimum_prob_strength = %.6f
! unknown_word_strength = %.6f

  [TestDriver]