[Spambayes-checkins] spambayes Options.py,1.49,1.50 README.txt,1.37,1.38 TestDriver.py,1.25,1.26 classifier.py,1.38,1.39 clgen.py,1.1,NONE clpik.py,1.1,NONE rmspik.py,1.4,NONE

Tim Peters tim_one@users.sourceforge.net
Thu, 17 Oct 2002 22:44:07 -0700


Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv25258

Modified Files:
	Options.py README.txt TestDriver.py classifier.py 
Removed Files:
	clgen.py clpik.py rmspik.py 
Log Message:
Removed 4 combining schemes:

    use_central_limit
    use_central_limit2
    use_central_limit3
    use_z_combining

The central limit schemes aimed at getting a useful middle ground, but
chi-combining has proved to work better for that.  The chi scheme doesn't
require the troublesome "third training pass" either.  z-combining was
more like chi-combining, and worked well, but not as well as chi-
combining; z-combining proved vulnerable to "cancellation disease", to
which chi-combining seems all but immune.

Removed supporting option zscore_ratio_cutoff.

Removed various data attributes of class Bayes, unique to the central
limit schemes.  __getstate__ and __setstate__ had never been
updated to save or restore them, so old pickles will still work fine.

Removed method Bayes.compute_population_stats(), which constituted
"the third training pass" unique to the central limit schemes.  There's
scant chance this will ever be needed again, since it was never clear
how to make the 3-pass schemes practical over time.

Gave the still-default combining scheme's method the name gary_spamprob,
and made spamprob an alias for that by default.  This allows to name
each combining scheme explicitly in case you want to test using more
than one (the others are named tim_spamprob and chi2_spamprob).

In gary_spamprob, simplified the scaling of (P-Q)/(P+Q) into 0 .. 1,
replacing the whole shebang with P/(P+Q).  Same result, but a little
faster.

Removed files clgen.py, clpik.py, and rmspik.py.  These were data
generation and analysis tools unique to the central limit schemes.


Index: Options.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Options.py,v
retrieving revision 1.49
retrieving revision 1.50
diff -C2 -d -r1.49 -r1.50
*** Options.py	17 Oct 2002 06:23:13 -0000	1.49
--- Options.py	18 Oct 2002 05:44:04 -0000	1.50
***************
*** 199,209 ****
  # on.  By default, it does this in a clever way, learning *and* unlearning
  # sets as it goes along, so that it never needs to train on N-1 sets in one
! # gulp after the first time.  However, that can't always be done:  in
! # particular, the central-limit schemes can't unlearn incrementally, and can
! # learn incrementally only via a form of cheating whose bad effects overall
! # aren't yet known.
! # So when desiring to run a central-limit test, set
! # build_each_classifier_from_scratch to true.  This gives correct results,
! # but runs much slower than a CV driver usually runs.
  build_each_classifier_from_scratch: False
  
--- 199,205 ----
  # on.  By default, it does this in a clever way, learning *and* unlearning
  # sets as it goes along, so that it never needs to train on N-1 sets in one
! # gulp after the first time.  Setting this option true forces "one gulp
! # from-scratch" training every time.  There used to be a set of combining
! # schemes that needed this, but now it's just in case you're paranoid <wink>.
  build_each_classifier_from_scratch: False
  
***************
*** 238,253 ****
  robinson_minimum_prob_strength: 0.1
  
! ###########################################################################
! # Speculative options for Gary Robinson's central-limit ideas.  These may go
! # away, or a bunch of incompatible stuff above may go away.
! 
! # For the default scheme, use "tim-combining" of probabilities.  This has
! # no effect under the central-limit schemes.  Tim-combining is a kind of
! # cross between Paul Graham's and Gary Robinson's combining schemes.  Unlike
! # Paul's, it's never crazy-certain, and compared to Gary's, in Tim's tests it
! # greatly increased the spread between mean ham-scores and spam-scores, while
! # simultaneously decreasing the variance of both.  Tim needed a higher
! # spam_cutoff value for best results, but spam_cutoff is less touchy
! # than under Gary-combining.
  use_tim_combining: False
  
--- 234,244 ----
  robinson_minimum_prob_strength: 0.1
  
! # For the default scheme, use "tim-combining" of probabilities.  Tim-
! # combining is a kind of cross between Paul Graham's and Gary Robinson's
! # combining schemes.  Unlike Paul's, it's never crazy-certain, and compared
! # to Gary's, in Tim's tests it greatly increased the spread between mean
! # ham-scores and spam-scores, while simultaneously decreasing the variance
! # of both.  Tim needed a higher spam_cutoff value for best results, but
! # spam_cutoff is less touchy than under Gary-combining.
  use_tim_combining: False
  
***************
*** 262,300 ****
  # systematic drawback is that it's sensitive to *any* deviation from a
  # uniform distribution, regardless of whether that's actually evidence of
! # ham or spam.  Rob Hooft may have a pragmatic cure for that (combine the
! # final S and H measures via (S-H+1)/2 instead of via S/(S+H)).
  use_chi_squared_combining: False
- 
- # z_combining is a scheme Gary has discussed with me offline.  I'll say more
- # if it proves promising.  In initial tests it was even more extreme than
- # chi combining, but not always in a good way -- in particular, it appears
- # as vulnerable to "cancellation disease" as Graham-combining, giving one
- # spam in my corpus a score of 4.1e-14 (chi combining scored it 0.5).
- use_z_combining: False
- 
- # Use a central-limit approach for scoring.
- # The number of extremes to use is given by max_discriminators (above).
- # spam_cutoff should almost certainly be exactly 0.5 when using this approach.
- # DO NOT run cross-validation tests when this is enabled!  They'll deliver
- # nonense, or, if you're lucky, will blow up with division by 0 or negative
- # square roots.  An NxN test grid should work fine.
- use_central_limit: False
- 
- # Same as use_central_limit, except takes logarithms of probabilities and
- # probability complements (p and 1-p) instead.
- use_central_limit2: False
- use_central_limit3: False
- 
- # For now, a central-limit scheme considers its decision "certain" if the
- # ratio of the zscore with larger magnitude to the zscore with smaller
- # magnitude exceeds zscore_ratio_cutoff.  The value here is seat-of-the-
- # pants for use_central_limit2; nothing is known about use_central_limit wrt
- # this.
- # For now, a central-limit scheme delivers just one of 4 scores:
- # 0.00  -- certain it's ham
- # 0.49  -- guesses ham but is unsure
- # 0.51  -- guesses spam but is unsure
- # 1.00  -- certain it's spam
- zscore_ratio_cutoff: 1.9
  """
  
--- 253,262 ----
  # systematic drawback is that it's sensitive to *any* deviation from a
  # uniform distribution, regardless of whether that's actually evidence of
! # ham or spam.  Rob Hooft alleviated that by combining the final S and H
! # measures via (S-H+1)/2 instead of via S/(S+H)).
! # In practice, it appears that setting ham_cutoff=0.05, and spam_cutoff=0.95,
! # does well across test sets; while these cutoffs are rarely optimal, they
! # get close to optimal.
  use_chi_squared_combining: False
  """
  
***************
*** 346,358 ****
                     'robinson_probability_s': float_cracker,
                     'robinson_minimum_prob_strength': float_cracker,
- 
-                    'use_central_limit': boolean_cracker,
-                    'use_central_limit2': boolean_cracker,
-                    'use_central_limit3': boolean_cracker,
-                    'zscore_ratio_cutoff': float_cracker,
- 
                     'use_tim_combining': boolean_cracker,
                     'use_chi_squared_combining': boolean_cracker,
-                    'use_z_combining': boolean_cracker,
                     },
  }
--- 308,313 ----

Index: README.txt
===================================================================
RCS file: /cvsroot/spambayes/spambayes/README.txt,v
retrieving revision 1.37
retrieving revision 1.38
diff -C2 -d -r1.37 -r1.38
*** README.txt	13 Oct 2002 19:25:42 -0000	1.37
--- README.txt	18 Oct 2002 05:44:04 -0000	1.38
***************
*** 118,125 ****
          remaining set (the set not used to train the classifier).
      mboxtest does the same.
-     timcv should not be used for central limit tests (timcv does
-         incremental learning and unlearning, for efficiency; the central
-         limit schemes can't unlearn incrementally, and their incremental
-         learning ability is a cheat whose badness isn't yet known).
      This (or mboxtest) is the preferred way to test when possible:  it
          makes best use of limited data, and interpreting results is
--- 118,121 ----
***************
*** 140,144 ****
          because each msg is predicted against N-1 times overall.  So, e.g.,
          one terribly difficult spam or ham can count against you N-1 times.
-     Central limit tests are fine with timtest.
  
  
--- 136,139 ----
***************
*** 205,227 ****
  Experimental Files
  ==================
- clgen.py
-     A test driver only for use with one of the speculative central-limit
-     schemes.  Its purpose is to generate a binary pickle containing
-     internal information about every prediction made.  This will go
-     away someday.
- 
- clpik.py
-     An example analysis program showing how to access the pickles
-     produced by clgen.py, and how to generate potentially interesting
-     histograms from them.
- 
- rmspik.py
-     A program that analyzes a clgen-produced pickle, and tells you what
-     would happen if we had used Rob Hooft's "RMS ZScore" scheme for
-     deciding certainty instead.
-     CAUTION:  This doesn't work as intended for plain use_central_limit.
-     The chance() function seems to make an assumption that's true
-     only under use_central_limit2 and use_central_limit3.
- 
  cvcost.py
      A program that analyzes the output of timcv.py (the final histograms)
--- 200,203 ----
***************
*** 230,233 ****
--- 206,210 ----
      pseudo-realistic costs to handle a fp, a fn and to handle a message
      in the grey zone.
+ 
  
  Standard Test Data Setup

Index: TestDriver.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/TestDriver.py,v
retrieving revision 1.25
retrieving revision 1.26
diff -C2 -d -r1.25 -r1.26
*** TestDriver.py	18 Oct 2002 05:04:46 -0000	1.25
--- TestDriver.py	18 Oct 2002 05:44:05 -0000	1.26
***************
*** 141,152 ****
          self.trained_spam_hist = Hist()
  
-     # CAUTION:  When options.use_central_limit{,2,3} is in effect, this
-     # adds the new population statistics to the existing population statistics
-     # (if any), but the existing population statistics are no longer correct
-     # due to the new data we just added (which can change spamprobs, and
-     # even the *set* of extreme words).  There's no thoroughly correct way
-     # to repair this short of recomputing the population statistics for
-     # every msg *ever* trained on.  It's currently unknown how badly this
-     # cheat may affect results.
      def train(self, ham, spam):
          print "-> Training on", ham, "&", spam, "...",
--- 141,144 ----
***************
*** 155,163 ****
          self.tester.train(ham, spam)
          print c.nham - nham, "hams &", c.nspam- nspam, "spams"
-         c.compute_population_stats(ham, False)
-         c.compute_population_stats(spam, True)
  
-     # CAUTION:  this doesn't work at all for incrememental training when
-     # options.use_central_limit{,2,3} is in effect.
      def untrain(self, ham, spam):
          print "-> Forgetting", ham, "&", spam, "...",
--- 147,151 ----

Index: classifier.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/classifier.py,v
retrieving revision 1.38
retrieving revision 1.39
diff -C2 -d -r1.38 -r1.39
*** classifier.py	14 Oct 2002 02:20:35 -0000	1.38
--- classifier.py	18 Oct 2002 05:44:05 -0000	1.39
***************
*** 34,40 ****
      LN2 = math.log(2)
  
- if options.use_z_combining:
-     from chi2 import normP, normIP
- 
  # The maximum number of extreme words to look at in a msg, where "extreme"
  # means with spamprob farthest away from 0.5.
--- 34,37 ----
***************
*** 86,113 ****
                   'nspam',     # number of spam messages learn() has seen
                   'nham',      # number of non-spam messages learn() has seen
- 
-                  # The rest is unique to the central-limit code.
-                  # n is the # of data points in the population.
-                  # sum is the sum of the probabilities, and is a long scaled
-                  # by 2**64.
-                  # sumsq is the sum of the squares of the probabilities, and
-                  # is a long scaled by 2**128.
-                  # mean is the mean probability of the population, as an
-                  # unscaled float.
-                  # var is the variance of the population, as unscaled float.
-                  # There's one set of these for the spam population, and
-                  # another for the ham population.
-                  # XXX If this code survives, clean it up.
-                  'spamn',
-                  'spamsum',
-                  'spamsumsq',
-                  'spammean',
-                  'spamvar',
- 
-                  'hamn',
-                  'hamsum',
-                  'hamsumsq',
-                  'hammean',
-                  'hamvar',
                  )
  
--- 83,86 ----
***************
*** 115,121 ****
          self.wordinfo = {}
          self.nspam = self.nham = 0
-         self.spamn = self.hamn = 0
-         self.spamsum = self.spamsumsq = 0
-         self.hamsum = self.hamsumsq = 0
  
      def __getstate__(self):
--- 88,91 ----
***************
*** 127,131 ****
          self.wordinfo, self.nspam, self.nham = t[1:]
  
!     def spamprob(self, wordstream, evidence=False):
          """Return best-guess probability that wordstream is spam.
  
--- 97,101 ----
          self.wordinfo, self.nspam, self.nham = t[1:]
  
!     def gary_spamprob(self, wordstream, evidence=False):
          """Return best-guess probability that wordstream is spam.
  
***************
*** 180,185 ****
              Q = 1.0 - Q**n * 2.0**(Qexp * n)
  
!             prob = (P-Q)/(P+Q)  # in -1 .. 1
!             prob = 0.5 + prob/2 # shift to 0 .. 1
          else:
              prob = 0.5
--- 150,159 ----
              Q = 1.0 - Q**n * 2.0**(Qexp * n)
  
!             # (P-Q)/(P+Q) is in -1 .. 1; scaling into 0 .. 1 gives
!             # ((P-Q)/(P+Q)+1)/2 =
!             # ((P-Q+P-Q)/(P+Q)/2 =
!             # (2*P/(P+Q)/2 =
!             # P/(P+Q)
!             prob = P/(P+Q)
          else:
              prob = 0.5
***************
*** 192,195 ****
--- 166,171 ----
              return prob
  
+     spamprob = gary_spamprob    # may be replaced later
+ 
      def learn(self, wordstream, is_spam, update_probabilities=True):
          """Teach the classifier by example.
***************
*** 357,363 ****
                      del self.wordinfo[word]
  
-     def compute_population_stats(self, msgstream, is_spam):
-         pass
- 
      def _getclues(self, wordstream):
          mindist = options.robinson_minimum_prob_strength
--- 333,336 ----
***************
*** 544,803 ****
      if options.use_chi_squared_combining:
          spamprob = chi2_spamprob
- 
-     def z_spamprob(self, wordstream, evidence=False):
-         """Return best-guess probability that wordstream is spam.
- 
-         wordstream is an iterable object producing words.
-         The return value is a float in [0.0, 1.0].
- 
-         If optional arg evidence is True, the return value is a pair
-             probability, evidence
-         where evidence is a list of (word, probability) pairs.
-         """
- 
-         from math import sqrt
- 
-         clues = self._getclues(wordstream)
-         zsum = 0.0
-         for prob, word, record in clues:
-             if record is not None:  # else wordinfo doesn't know about it
-                 record.killcount += 1
-             zsum += normIP(prob)
- 
-         n = len(clues)
-         if n:
-             # We've added n zscores from a unit normal distribution.  By the
-             # central limit theorem, their mean is normally distributed with
-             # mean 0 and sdev 1/sqrt(n).  So the zscore of zsum/n is
-             # (zsum/n - 0)/(1/sqrt(n)) = zsum/n/(1/sqrt(n)) = zsum/sqrt(n).
-             prob = normP(zsum / sqrt(n))
-         else:
-             prob = 0.5
- 
-         if evidence:
-             clues = [(w, p) for p, w, r in clues]
-             clues.sort(lambda a, b: cmp(a[1], b[1]))
-             clues.insert(0, ('*zsum*', zsum))
-             clues.insert(0, ('*n*', n))
-             clues.insert(0, ('*zscore*', zsum / sqrt(n or 1)))
-             return prob, clues
-         else:
-             return prob
- 
-     if options.use_z_combining:
-         spamprob = z_spamprob
- 
-     def _add_popstats(self, sum, sumsq, n, is_spam):
-         from math import ldexp
- 
-         if is_spam:
-             sum += self.spamsum
-             sumsq += self.spamsumsq
-             n += self.spamn
-             self.spamsum, self.spamsumsq, self.spamn = sum, sumsq, n
-         else:
-             sum += self.hamsum
-             sumsq += self.hamsumsq
-             n += self.hamn
-             self.hamsum, self.hamsumsq, self.hamn = sum, sumsq, n
- 
-         mean = ldexp(sum, -64) / n
-         var = sumsq * n - sum**2
-         var = ldexp(var, -128) / n**2
- 
-         if is_spam:
-             self.spammean, self.spamvar = mean, var
-         else:
-             self.hammean, self.hamvar = mean, var
- 
-     def central_limit_compute_population_stats(self, msgstream, is_spam):
-         from math import ldexp
- 
-         sum = sumsq = 0
-         seen = {}
-         for msg in msgstream:
-             for prob, word, record in self._getclues(msg):
-                 if word in seen:
-                     continue
-                 seen[word] = 1
-                 prob = long(ldexp(prob, 64))
-                 sum += prob
-                 sumsq += prob * prob
- 
-         self._add_popstats(sum, sumsq, len(seen), is_spam)
- 
-     if options.use_central_limit:
-         compute_population_stats = central_limit_compute_population_stats
- 
-     def central_limit_spamprob(self, wordstream, evidence=False):
-         """Return best-guess probability that wordstream is spam.
- 
-         wordstream is an iterable object producing words.
-         The return value is a float in [0.0, 1.0].
- 
-         If optional arg evidence is True, the return value is a pair
-             probability, evidence
-         where evidence is a list of (word, probability) pairs.
-         """
- 
-         from math import sqrt
- 
-         clues = self._getclues(wordstream)
-         sum = 0.0
-         for prob, word, record in clues:
-             sum += prob
-             if record is not None:
-                 record.killcount += 1
-         n = len(clues)
-         if n == 0:
-             return 0.5
-         mean = sum / n
- 
-         # If this sample is drawn from the spam population, its mean is
-         # distributed around spammean with variance spamvar/n.  Likewise
-         # for if it's drawn from the ham population.  Compute a normalized
-         # z-score (how many stddevs is it away from the population mean?)
-         # against both populations, and then it's ham or spam depending
-         # on which population it matches better.
-         zham = (mean - self.hammean) / sqrt(self.hamvar / n)
-         zspam = (mean - self.spammean) / sqrt(self.spamvar / n)
-         delta = abs(zham) - abs(zspam)  # > 0 for spam, < 0 for ham
- 
-         azham, azspam = abs(zham), abs(zspam)
-         if azham < azspam:
-             ratio = azspam / max(azham, 1e-10) # guard against 0 division
-         else:
-             ratio = azham / max(azspam, 1e-10) # guard against 0 division
-         certain = ratio > options.zscore_ratio_cutoff
- 
-         if certain:
-             score = delta > 0.0 and 1.0 or 0.0
-         else:
-             score = delta > 0.0 and 0.51 or 0.49
- 
-         if evidence:
-             clues = [(word, prob) for prob, word, record in clues]
-             clues.sort(lambda a, b: cmp(a[1], b[1]))
-             extra = [('*zham*', zham),
-                      ('*zspam*', zspam),
-                      ('*hmean*', mean),
-                      ('*smean*', mean),
-                      ('*n*', n),
-                     ]
-             clues[0:0] = extra
-             return score, clues
-         else:
-             return score
- 
-     if options.use_central_limit:
-         spamprob = central_limit_spamprob
- 
-     def central_limit_compute_population_stats2(self, msgstream, is_spam):
-         from math import ldexp, log
- 
-         sum = sumsq = 0
-         seen = {}
-         for msg in msgstream:
-             for prob, word, record in self._getclues(msg):
-                 if word in seen:
-                     continue
-                 seen[word] = 1
-                 if is_spam:
-                     prob = log(prob)
-                 else:
-                     prob = log(1.0 - prob)
-                 prob = long(ldexp(prob, 64))
-                 sum += prob
-                 sumsq += prob * prob
- 
-         self._add_popstats(sum, sumsq, len(seen), is_spam)
- 
-     if options.use_central_limit2:
-         compute_population_stats = central_limit_compute_population_stats2
- 
-     def central_limit_spamprob2(self, wordstream, evidence=False):
-         """Return best-guess probability that wordstream is spam.
- 
-         wordstream is an iterable object producing words.
-         The return value is a float in [0.0, 1.0].
- 
-         If optional arg evidence is True, the return value is a pair
-             probability, evidence
-         where evidence is a list of (word, probability) pairs.
-         """
- 
-         from math import sqrt, log
- 
-         clues = self._getclues(wordstream)
-         hsum = ssum = 0.0
-         for prob, word, record in clues:
-             ssum += log(prob)
-             hsum += log(1.0 - prob)
-             if record is not None:
-                 record.killcount += 1
-         n = len(clues)
-         if n == 0:
-             return 0.5
-         hmean = hsum / n
-         smean = ssum / n
- 
-         # If this sample is drawn from the spam population, its mean is
-         # distributed around spammean with variance spamvar/n.  Likewise
-         # for if it's drawn from the ham population.  Compute a normalized
-         # z-score (how many stddevs is it away from the population mean?)
-         # against both populations, and then it's ham or spam depending
-         # on which population it matches better.
-         zham = (hmean - self.hammean) / sqrt(self.hamvar / n)
-         zspam = (smean - self.spammean) / sqrt(self.spamvar / n)
-         delta = abs(zham) - abs(zspam)  # > 0 for spam, < 0 for ham
- 
-         azham, azspam = abs(zham), abs(zspam)
-         if azham < azspam:
-             ratio = azspam / max(azham, 1e-10) # guard against 0 division
-         else:
-             ratio = azham / max(azspam, 1e-10) # guard against 0 division
-         certain = ratio > options.zscore_ratio_cutoff
- 
-         if certain:
-             score = delta > 0.0 and 1.0 or 0.0
-         else:
-             score = delta > 0.0 and 0.51 or 0.49
- 
-         if evidence:
-             clues = [(word, prob) for prob, word, record in clues]
-             clues.sort(lambda a, b: cmp(a[1], b[1]))
-             extra = [('*zham*', zham),
-                      ('*zspam*', zspam),
-                      ('*hmean*', hmean),
-                      ('*smean*', smean),
-                      ('*n*', n),
-                     ]
-             clues[0:0] = extra
-             return score, clues
-         else:
-             return score
- 
-     if options.use_central_limit2 or options.use_central_limit3:
-         spamprob = central_limit_spamprob2
- 
-     def central_limit_compute_population_stats3(self, msgstream, is_spam):
-         from math import ldexp, log
- 
-         sum = sumsq = n = 0
-         for msg in msgstream:
-             n += 1
-             probsum = 0.0
-             clues = self._getclues(msg)
-             for prob, word, record in clues:
-                 if is_spam:
-                     probsum += log(prob)
-                 else:
-                     probsum += log(1.0 - prob)
-             mean = long(ldexp(probsum / len(clues), 64))
-             sum += mean
-             sumsq += mean * mean
- 
-         self._add_popstats(sum, sumsq, n, is_spam)
- 
-     if options.use_central_limit3:
-         compute_population_stats = central_limit_compute_population_stats3
--- 517,518 ----

--- clgen.py DELETED ---

--- clpik.py DELETED ---

--- rmspik.py DELETED ---