[Spambayes-checkins] spambayes Options.py,1.34,1.35 classifier.py,1.21,1.22

Tim Peters tim_one@users.sourceforge.net
Fri, 27 Sep 2002 15:29:58 -0700


Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv29156

Modified Files:
	Options.py classifier.py 
Log Message:
Gary's "f(w)" scheme is now the default, and code unique to the
Graham scheme has gone away (but was tagged with Last-Graham).

These options have vanished:
    hambias
    spambias
    min_spamprob
    max_spamprob
    unknown_word_spamprob
    use_robinson_combining
    use_robinson_probability
    use_robinson_ranking

These options have changed default value:
    robinson_probability_a: 0.225 (was 1.0)
    robinson_minimum_prob_strength: 0.1 (was 0.0)
    max_discriminators: 150 (was 16)
    spam_cutoff: 0.570 (was 0.90) # THIS IS CORPUS-DEPENDENT!

In addition, I did a little long-overdue refactoring of the classifier
internals.  The visible interface hasn't changed.


Index: Options.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Options.py,v
retrieving revision 1.34
retrieving revision 1.35
diff -C2 -d -r1.34 -r1.35
*** Options.py	27 Sep 2002 04:02:59 -0000	1.34
--- Options.py	27 Sep 2002 22:29:56 -0000	1.35
***************
*** 100,110 ****
  
  # A message is considered spam iff it scores greater than spam_cutoff.
! # If using Graham's combining scheme, 0.90 seems to work best for "small"
! # training sets.  As the size of the training sets increase, there's not
! # yet any bound in sight for how low this can go (0.075 would work as
! # well as 0.90 on Tim's large c.l.py data).
! # For Gary Robinson's scheme, some value between 0.50 and 0.60 has worked
! # best in all reports so far.
! spam_cutoff: 0.90
  
  # Number of buckets in histograms.
--- 100,106 ----
  
  # A message is considered spam iff it scores greater than spam_cutoff.
! # This is corpus-dependent, and values into the .600's have been known
! # to work best on some data.
! spam_cutoff: 0.570
  
  # Number of buckets in histograms.
***************
*** 174,219 ****
  
  [Classifier]
! # Fiddling these can have extreme effects.  See classifier.py for comments.
! hambias: 2.0
! spambias: 1.0
! 
! min_spamprob: 0.01
! max_spamprob: 0.99
! unknown_spamprob: 0.5
! 
! max_discriminators: 16
! 
! ###########################################################################
! # Speculative options for Gary Robinson's ideas.  These may go away, or
! # a bunch of incompatible stuff above may go away.
! 
! # Use Gary's scheme for combining probabilities.
! use_robinson_combining: False
  
! # Use Gary's scheme for computing probabilities, along with its "a" and
! # "x" parameters.
! use_robinson_probability: False
! robinson_probability_a: 1.0
  robinson_probability_x: 0.5
  
- # Use Gary's scheme for ranking probabilities.
- use_robinson_ranking: False
- 
  # When scoring a message, ignore all words with
  # abs(word.spamprob - 0.5) < robinson_minimum_prob_strength.
! # By default (0.0), nothing is ignored.
! # Tim got a pretty clear improvement in f-n rate on his hasn't-improved-in-
! # a-long-time large c.l.py test by using 0.1.  No other values have been
! # tried yet.
! # Neil Schemenauer also reported good results from 0.1, making the all-
! # Robinson scheme match the all-default Graham-like scheme on a smaller
! # and different corpus.
! # NOTE:  Changing this may change the best spam_cutoff value for your
! # corpus.  Since one effect is to separate the means more, you'll probably
! # want a higher spam_cutoff.
! robinson_minimum_prob_strength: 0.0
  
  ###########################################################################
! # More speculative options for Gary Robinson's central-limit.  These may go
  # away, or a bunch of incompatible stuff above may go away.
  
--- 170,204 ----
  
  [Classifier]
! # The maximum number of extreme words to look at in a msg, where "extreme"
! # means with spamprob farthest away from 0.5.  150 appears to work well
! # across all corpora tested.
! max_discriminators: 150
  
! # These two control the prior assumption about word probabilities.
! # "x" is essentially the probability given to a word that's never been
! # seen before.  Nobody has reported an improvement via moving it away
! # from 1/2.
! # "a" adjusts how much weight to give the prior assumption relative to
! # the probabilities estimated by counting.  At a=0, the counting estimates
! # are believed 100%, even to the extent of assigning certainty (0 or 1)
! # to a word that's appeared in only ham or only spam.  This is a disaster.
! # As "a" tends toward infintity, all probabilities tend toward "x".  All
! # reports were that a value near 0.2 worked best, so this doesn't seem to
! # be corpus-dependent.
! # XXX Gary Robinson has since renamed "a" to "s", and redone his formulas
! # XXX to make it a measure of belief strength rather than "a number" from
! # XXX 0 to infinity.  We haven't caught up to that yet.
! robinson_probability_a: 0.225
  robinson_probability_x: 0.5
  
  # When scoring a message, ignore all words with
  # abs(word.spamprob - 0.5) < robinson_minimum_prob_strength.
! # This may be a hack, but it has proved to reduce error rates in many
! # tests over Robinson's base scheme.  0.1 appeared to work well across
! # all corpora.
! robinson_minimum_prob_strength: 0.1
  
  ###########################################################################
! # Speculative options for Gary Robinson's central-limit ideas.  These may go
  # away, or a bunch of incompatible stuff above may go away.
  
***************
*** 268,282 ****
                     'best_cutoff_fp_weight': float_cracker,
                    },
!     'Classifier': {'hambias': float_cracker,
!                    'spambias': float_cracker,
!                    'min_spamprob': float_cracker,
!                    'max_spamprob': float_cracker,
!                    'unknown_spamprob': float_cracker,
!                    'max_discriminators': int_cracker,
!                    'use_robinson_combining': boolean_cracker,
!                    'use_robinson_probability': boolean_cracker,
                     'robinson_probability_a': float_cracker,
                     'robinson_probability_x': float_cracker,
-                    'use_robinson_ranking': boolean_cracker,
                     'robinson_minimum_prob_strength': float_cracker,
  
--- 253,259 ----
                     'best_cutoff_fp_weight': float_cracker,
                    },
!     'Classifier': {'max_discriminators': int_cracker,
                     'robinson_probability_a': float_cracker,
                     'robinson_probability_x': float_cracker,
                     'robinson_minimum_prob_strength': float_cracker,
  

Index: classifier.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/classifier.py,v
retrieving revision 1.21
retrieving revision 1.22
diff -C2 -d -r1.21 -r1.22
*** classifier.py	27 Sep 2002 21:18:18 -0000	1.21
--- classifier.py	27 Sep 2002 22:29:56 -0000	1.22
***************
*** 1,178 ****
! # This is an implementation of the Bayes-like spam classifier sketched
! # by Paul Graham at <http://www.paulgraham.com/spam.html>.  We say
! # "Bayes-like" because there are many ad hoc deviations from a
! # "normal" Bayesian classifier.
! #
! # This implementation is due to Tim Peters et alia.
! 
! import time
! from heapq import heapreplace
! from sets import Set
! 
! from Options import options
! 
! # The count of each word in ham is artificially boosted by a factor of
! # HAMBIAS, and similarly for SPAMBIAS.  Graham uses 2.0 and 1.0.  Final
! # results are very sensitive to the HAMBIAS value.  On my 5x5 c.l.py
! # test grid with 20,000 hams and 13,750 spams split into 5 pairs, then
! # across all 20 test runs (for each pair, training on that pair then scoring
! # against the other 4 pairs), and counting up all the unique msgs ever
! # identified as false negative or positive, then compared to HAMBIAS 2.0,
! #
! # At HAMBIAS 1.0
! #    total unique false positives goes up   by a factor of 7.6 ( 23 -> 174)
! #    total unique false negatives goes down by a factor of 2   (337 -> 166)
! #
! # At HAMBIAS 3.0
! #    total unique false positives goes down by a factor of 4.6 ( 23 ->   5)
! #    total unique false negatives goes up   by a factor of 2.1 (337 -> 702)
! 
! HAMBIAS  = options.hambias  # 2.0
! SPAMBIAS = options.spambias # 1.0
! 
! # "And then there is the question of what probability to assign to words
! # that occur in one corpus but not the other. Again by trial and error I
! # chose .01 and .99.".  However, the code snippet clamps *all* probabilities
! # into this range.  That's good in principle (IMO), because no finite amount
! # of training data is good enough to justify probabilities of 0 or 1.  It
! # may justify probabilities outside this range, though.
! MIN_SPAMPROB = options.min_spamprob # 0.01
! MAX_SPAMPROB = options.max_spamprob # 0.99
! 
! # The spam probability assigned to words never seen before.  Graham used
! # 0.2 here.  Neil Schemenauer reported that 0.5 seemed to work better.  In
! # Tim's content-only tests (no headers), boosting to 0.5 cut the false
! # negative rate by over 1/3.  The f-p rate increased, but there were so few
! # f-ps that the increase wasn't statistically significant.  It also caught
! # 13 more spams erroneously classified as ham.  By eyeball (and common
! # sense <wink>), this has most effect on very short messages, where there
! # simply aren't many high-value words.  A word with prob 0.5 is (in effect)
! # completely ignored by spamprob(), in favor of *any* word with *any* prob
! # differing from 0.5.  At 0.2, an unknown word favors ham at the expense
! # of kicking out a word with a prob in (0.2, 0.8), and that seems dubious
! # on the face of it.
! UNKNOWN_SPAMPROB = options.unknown_spamprob # 0.5
! 
! # "I only consider words that occur more than five times in total".
! # But the code snippet considers words that appear at least five times.
! # This implementation follows the code rather than the explanation.
! # (In addition, the count compared is after multiplying it with the
! # appropriate bias factor.)
! #
! # Twist:  Graham used MINCOUNT=5.0 here.  I got rid of it:  in effect,
! # given HAMBIAS=2.0, it meant we ignored a possibly perfectly good piece
! # of spam evidence unless it appeared at least 5 times, and ditto for
! # ham evidence unless it appeared at least 3 times.  That certainly does
! # bias in favor of ham, but multiple distortions in favor of ham are
! # multiple ways to get confused and trip up.  Here are the test results
! # before and after, MINCOUNT=5.0 on the left, no MINCOUNT on the right;
! # ham sets had 4000 msgs (so 0.025% is one msg), and spam sets 2750:
! #
! # false positive percentages
! #     0.000  0.000  tied
! #     0.000  0.000  tied
! #     0.100  0.050  won    -50.00%
! #     0.000  0.025  lost  +(was 0)
! #     0.025  0.075  lost  +200.00%
! #     0.025  0.000  won   -100.00%
! #     0.100  0.100  tied
! #     0.025  0.050  lost  +100.00%
! #     0.025  0.025  tied
! #     0.050  0.025  won    -50.00%
! #     0.100  0.050  won    -50.00%
! #     0.025  0.050  lost  +100.00%
! #     0.025  0.050  lost  +100.00%
! #     0.025  0.000  won   -100.00%
! #     0.025  0.000  won   -100.00%
! #     0.025  0.075  lost  +200.00%
! #     0.025  0.025  tied
! #     0.000  0.000  tied
! #     0.025  0.025  tied
! #     0.100  0.050  won    -50.00%
  #
! # won   7 times
! # tied  7 times
! # lost  6 times
  #
! # total unique fp went from 9 to 13
  #
! # false negative percentages
! #     0.364  0.327  won    -10.16%
! #     0.400  0.400  tied
! #     0.400  0.327  won    -18.25%
! #     0.909  0.691  won    -23.98%
! #     0.836  0.545  won    -34.81%
! #     0.618  0.291  won    -52.91%
! #     0.291  0.218  won    -25.09%
! #     1.018  0.654  won    -35.76%
! #     0.982  0.364  won    -62.93%
! #     0.727  0.291  won    -59.97%
! #     0.800  0.327  won    -59.13%
! #     1.163  0.691  won    -40.58%
! #     0.764  0.582  won    -23.82%
! #     0.473  0.291  won    -38.48%
! #     0.473  0.364  won    -23.04%
! #     0.727  0.436  won    -40.03%
! #     0.655  0.436  won    -33.44%
! #     0.509  0.218  won    -57.17%
! #     0.545  0.291  won    -46.61%
! #     0.509  0.254  won    -50.10%
  #
! # won  19 times
! # tied  1 times
! # lost  0 times
  #
! # total unique fn went from 168 to 106
  #
! # So dropping MINCOUNT was a huge win for the f-n rate, and a mixed bag
! # for the f-p rate (but the f-p rate was so low compared to 4000 msgs that
! # even the losses were barely significant).  In addition, dropping MINCOUNT
! # had a larger good effect when using random training subsets of size 500;
! # this makes intuitive sense, as with less training data it was harder to
! # exceed the MINCOUNT threshold.
  #
! # Still, MINCOUNT seemed to be a gross approximation to *something* valuable:
! # a strong clue appearing in 1,000 training msgs is certainly more trustworthy
! # than an equally strong clue appearing in only 1 msg.  I'm almost certain it
! # would pay to develop a way to take that into account when scoring.  In
! # particular, there was a very specific new class of false positives
! # introduced by dropping MINCOUNT:  some c.l.py msgs consisting mostly of
! # Spanish or French.  The "high probability" spam clues were innocuous
! # words like "puedo" and "como", that appeared in very rare Spanish and
! # French spam too.  There has to be a more principled way to address this
! # than the MINCOUNT hammer, and the test results clearly showed that MINCOUNT
! # did more harm than good overall.
  
  
! # The maximum number of words spamprob() pays attention to.  Graham had 15
! # here.  If there are 8 indicators with spam probabilities near 1, and 7
! # near 0, the math is such that the combined result is near 1.  Making this
! # even gets away from that oddity (8 of each allows for graceful ties,
! # which favor ham).
! #
! # XXX That should be revisited.  Stripping HTML tags from plain text msgs
! # XXX later addressed some of the same problem cases.  The best value for
! # XXX MAX_DISCRIMINATORS remains unknown, but increasing it a lot is known
! # XXX to hurt.
! # XXX Later:  tests after cutting this back to 15 showed no effect on the
! # XXX f-p rate, and a tiny shift in the f-n rate (won 3 times, tied 8 times,
! # XXX lost 9 times).  There isn't a significant difference, so leaving it
! # XXX at 16.
! #
! # A twist:  When staring at failures, it wasn't unusual to see the top
! # discriminators *all* have values of MIN_SPAMPROB and MAX_SPAMPROB.  The
! # math is such that one MIN_SPAMPROB exactly cancels out one MAX_SPAMPROB,
! # yielding no info at all.  Then whichever flavor of clue happened to reach
! # MAX_DISCRIMINATORS//2 + 1 occurrences first determined the final outcome,
! # based on almost no real evidence.
! #
! # So spamprob() was changed to save lists of *all* MIN_SPAMPROB and
! # MAX_SPAMPROB clues.  If the number of those are equal, they're all ignored.
! # Else the flavor with the smaller number of instances "cancels out" the
! # same number of instances of the other flavor, and the remaining instances
! # of the other flavor are fed into the probability computation.  This change
! # was a pure win, lowering the false negative rate consistently, and it even
! # managed to tickle a couple rare false positives into "not spam" terrority.
! MAX_DISCRIMINATORS = options.max_discriminators # 16
  
  PICKLE_VERSION = 1
--- 1,36 ----
! # An implementation of a Bayes-like spam classifier.
  #
! # Paul Graham's original description:
  #
! #     http://www.paulgraham.com/spam.html
  #
! # A highly fiddled version of that can be retrieved from our CVS repository,
! # via tag Last-Graham.  This made many demonstrated improvements in error
! # rates over Paul's original description.
  #
! # This code implements Gary Robinson's suggestions, which are well explained
! # on his webpage:
  #
! #    http://radio.weblogs.com/0101454/stories/2002/09/16/spamDetection.html
  #
! # This is theoretically cleaner, and in testing has performed at least as
! # well as our highly tuned Graham scheme did, often slightly better, and
! # sometimes much better.  It also has "a middle ground", which people like:
! # the scores under Paul's scheme were almost always very near 0 or very near
! # 1, whether or not the classification was correct.  The false positives
! # and false negatives under Gary's scheme generally score in a narrow range
! # around the corpus's best spam_cutoff value
  #
! # This implementation is due to Tim Peters et alia.
  
+ import time
+ from heapq import heapreplace
+ from sets import Set
  
! from Options import options
! 
! # The maximum number of extreme words to look at in a msg, where "extreme"
! # means with spamprob farthest away from 0.5.
! MAX_DISCRIMINATORS = options.max_discriminators # 150
  
  PICKLE_VERSION = 1
***************
*** 273,359 ****
          """
  
!         # A priority queue to remember the MAX_DISCRIMINATORS best
!         # probabilities, where "best" means largest distance from 0.5.
!         # The tuples are (distance, prob, word, wordinfo[word]).
!         nbest = [(-1.0, None, None, None)] * MAX_DISCRIMINATORS
!         smallest_best = -1.0
! 
!         wordinfoget = self.wordinfo.get
!         now = time.time()
!         mins = []   # all words w/ prob MIN_SPAMPROB
!         maxs = []   # all words w/ prob MAX_SPAMPROB
!         # Counting a unique word multiple times hurts, although counting one
!         # at most two times had some benefit whan UNKNOWN_SPAMPROB was 0.2.
!         # When that got boosted to 0.5, counting more than once became
!         # counterproductive.
!         for word in Set(wordstream):
!             record = wordinfoget(word)
!             if record is None:
!                 prob = UNKNOWN_SPAMPROB
!             else:
!                 record.atime = now
!                 prob = record.spamprob
! 
!             distance = abs(prob - 0.5)
!             if prob == MIN_SPAMPROB:
!                 mins.append((distance, prob, word, record))
!             elif prob == MAX_SPAMPROB:
!                 maxs.append((distance, prob, word, record))
!             elif distance > smallest_best:
!                 # Subtle:  we didn't use ">" instead of ">=" just to save
!                 # calls to heapreplace().  The real intent is that if
!                 # there are many equally strong indicators throughout the
!                 # message, we want to favor the ones that appear earliest:
!                 # it's expected that spam headers will often have smoking
!                 # guns, and, even when not, spam has to grab your attention
!                 # early (& note that when spammers generate large blocks of
!                 # random gibberish to throw off exact-match filters, it's
!                 # always at the end of the msg -- if they put it at the
!                 # start, *nobody* would read the msg).
!                 heapreplace(nbest, (distance, prob, word, record))
!                 smallest_best = nbest[0][0]
! 
!         # Compute the probability.  Note:  This is what Graham's code did,
!         # but it's dubious for reasons explained in great detail on Python-
!         # Dev:  it's missing P(spam) and P(not-spam) adjustments that
!         # straightforward Bayesian analysis says should be here.  It's
!         # unclear how much it matters, though, as the omissions here seem
!         # to tend in part to cancel out distortions introduced earlier by
!         # HAMBIAS.  Experiments will decide the issue.
!         clues = []
  
!         # First cancel out competing extreme clues (see comment block at
!         # MAX_DISCRIMINATORS declaration -- this is a twist on Graham).
!         if mins or maxs:
!             if len(mins) < len(maxs):
!                 shorter, longer = mins, maxs
!             else:
!                 shorter, longer = maxs, mins
!             tokeep = min(len(longer) - len(shorter), MAX_DISCRIMINATORS)
!             # They're all good clues, but we're only going to feed the tokeep
!             # initial clues from the longer list into the probability
!             # computation.
!             for dist, prob, word, record in shorter + longer[tokeep:]:
!                 record.killcount += 1
!                 if evidence:
!                     clues.append((word, prob))
!             for x in longer[:tokeep]:
!                 heapreplace(nbest, x)
  
!         prob_product = inverse_prob_product = 1.0
!         for distance, prob, word, record in nbest:
!             if prob is None:    # it's one of the dummies nbest started with
!                 continue
              if record is not None:  # else wordinfo doesn't know about it
                  record.killcount += 1
!             if evidence:
!                 clues.append((word, prob))
!             prob_product *= prob
!             inverse_prob_product *= 1.0 - prob
  
!         prob = prob_product / (prob_product + inverse_prob_product)
  
          if evidence:
!             clues.sort(lambda a, b: cmp(a[1], b[1]))
              return prob, clues
          else:
--- 131,184 ----
          """
  
!         from math import frexp
  
!         # This combination method is due to Gary Robinson; see
!         # http://radio.weblogs.com/0101454/stories/2002/09/16/spamDetection.html
  
!         # The real P = this P times 2**Pexp.  Likewise for Q.  We're
!         # simulating unbounded dynamic float range by hand.  If this pans
!         # out, *maybe* we should store logarithms in the database instead
!         # and just add them here.  But I like keeping raw counts in the
!         # database (they're easy to understand, manipulate and combine),
!         # and there's no evidence that this simulation is a significant
!         # expense.
!         P = Q = 1.0
!         Pexp = Qexp = 0
!         clues = self._getclues(wordstream)
!         for prob, word, record in clues:
              if record is not None:  # else wordinfo doesn't know about it
                  record.killcount += 1
!             P *= 1.0 - prob
!             Q *= prob
!             if P < 1e-200:  # move back into range
!                 P, e = frexp(P)
!                 Pexp += e
!             if Q < 1e-200:  # move back into range
!                 Q, e = frexp(Q)
!                 Qexp += e
  
!         P, e = frexp(P)
!         Pexp += e
!         Q, e = frexp(Q)
!         Qexp += e
! 
!         num_clues = len(clues)
!         if num_clues:
!             #P = 1.0 - P**(1./num_clues)
!             #Q = 1.0 - Q**(1./num_clues)
!             #
!             # (x*2**e)**n = x**n * 2**(e*n)
!             n = 1.0 / num_clues
!             P = 1.0 - P**n * 2.0**(Pexp * n)
!             Q = 1.0 - Q**n * 2.0**(Qexp * n)
! 
!             prob = (P-Q)/(P+Q)  # in -1 .. 1
!             prob = 0.5 + prob/2 # shift to 0 .. 1
!         else:
!             prob = 0.5
  
          if evidence:
!             clues.sort()
!             clues = [(w, p) for p, w, r in clues]
              return prob, clues
          else:
***************
*** 403,418 ****
          nham = float(self.nham or 1)
          nspam = float(self.nspam or 1)
!         for word,record in self.wordinfo.iteritems():
              # Compute prob(msg is spam | msg contains word).
!             hamcount = min(HAMBIAS * record.hamcount, nham)
!             spamcount = min(SPAMBIAS * record.spamcount, nspam)
              hamratio = hamcount / nham
              spamratio = spamcount / nspam
  
              prob = spamratio / (hamratio + spamratio)
!             if prob < MIN_SPAMPROB:
!                 prob = MIN_SPAMPROB
!             elif prob > MAX_SPAMPROB:
!                 prob = MAX_SPAMPROB
  
              if record.spamprob != prob:
--- 228,257 ----
          nham = float(self.nham or 1)
          nspam = float(self.nspam or 1)
!         A = options.robinson_probability_a
!         X = options.robinson_probability_x
!         AoverX = A/X
!         for word, record in self.wordinfo.iteritems():
              # Compute prob(msg is spam | msg contains word).
!             # This is the Graham calculation, but stripped of biases, and
!             # stripped of clamping into 0.01 thru 0.99.  The Bayesian
!             # adjustment following keeps them in a sane range, and one
!             # that naturally grows the more evidence there is to back up
!             # a probability.
!             hamcount = min(record.hamcount, nham)
              hamratio = hamcount / nham
+ 
+             spamcount = min(record.spamcount, nspam)
              spamratio = spamcount / nspam
  
              prob = spamratio / (hamratio + spamratio)
! 
!             # Now do Robinson's Bayesian adjustment.
!             #
!             #         a + (n * p(w))
!             # f(w) = ---------------
!             #          (a / x) + n
! 
!             n = hamcount + spamcount
!             prob = (A + n * prob) / (AoverX + n)
  
              if record.spamprob != prob:
***************
*** 481,487 ****
          pass
  
-     # XXX More stuff should be reworked to use this as a helper function.
      def _getclues(self, wordstream):
          mindist = options.robinson_minimum_prob_strength
  
          # A priority queue to remember the MAX_DISCRIMINATORS best
--- 320,326 ----
          pass
  
      def _getclues(self, wordstream):
          mindist = options.robinson_minimum_prob_strength
+         unknown = options.robinson_probability_x
  
          # A priority queue to remember the MAX_DISCRIMINATORS best
***************
*** 496,504 ****
              record = wordinfoget(word)
              if record is None:
!                 prob = UNKNOWN_SPAMPROB
              else:
                  record.atime = now
                  prob = record.spamprob
- 
              distance = abs(prob - 0.5)
              if distance >= mindist and distance > smallest_best:
--- 335,342 ----
              record = wordinfoget(word)
              if record is None:
!                 prob = unknown
              else:
                  record.atime = now
                  prob = record.spamprob
              distance = abs(prob - 0.5)
              if distance >= mindist and distance > smallest_best:
***************
*** 506,513 ****
                  smallest_best = nbest[0][0]
  
!         clues = [(prob, word, record)
!                     for distance, prob, word, record in nbest
!                     if prob is not None]
!         return clues
  
      #************************************************************************
--- 344,349 ----
                  smallest_best = nbest[0][0]
  
!         # Return (prob, word, record) for the non-dummies.
!         return [t[1:] for t in nbest if t[1] is not None]
  
      #************************************************************************
***************
*** 518,664 ****
      # to only one of the alternatives surviving.
  
-     def robinson_spamprob(self, wordstream, evidence=False):
-         """Return best-guess probability that wordstream is spam.
- 
-         wordstream is an iterable object producing words.
-         The return value is a float in [0.0, 1.0].
- 
-         If optional arg evidence is True, the return value is a pair
-             probability, evidence
-         where evidence is a list of (word, probability) pairs.
-         """
- 
-         from math import frexp
-         mindist = options.robinson_minimum_prob_strength
- 
-         # A priority queue to remember the MAX_DISCRIMINATORS best
-         # probabilities, where "best" means largest distance from 0.5.
-         # The tuples are (distance, prob, word, wordinfo[word]).
-         nbest = [(-1.0, None, None, None)] * MAX_DISCRIMINATORS
-         smallest_best = -1.0
- 
-         wordinfoget = self.wordinfo.get
-         now = time.time()
-         for word in Set(wordstream):
-             record = wordinfoget(word)
-             if record is None:
-                 prob = UNKNOWN_SPAMPROB
-             else:
-                 record.atime = now
-                 prob = record.spamprob
- 
-             distance = abs(prob - 0.5)
-             if distance >= mindist and distance > smallest_best:
-                 heapreplace(nbest, (distance, prob, word, record))
-                 smallest_best = nbest[0][0]
- 
-         # Compute the probability.
-         clues = []
- 
-         # This combination method is due to Gary Robinson.
-         # http://radio.weblogs.com/0101454/stories/2002/09/16/spamDetection.html
-         # In preliminary tests, it did just as well as Graham's scheme,
-         # but creates a definite "middle ground" around 0.5 where false
-         # negatives and false positives can actually found in non-trivial
-         # number.
- 
-         # The real P = this P times 2**Pexp.  Likewise for Q.  We're
-         # simulating unbounded dynamic float range by hand.  If this pans
-         # out, *maybe* we should store logarithms in the database instead
-         # and just add them here.
-         P = Q = 1.0
-         Pexp = Qexp = 0
-         num_clues = 0
-         for distance, prob, word, record in nbest:
-             if prob is None:    # it's one of the dummies nbest started with
-                 continue
-             if record is not None:  # else wordinfo doesn't know about it
-                 record.killcount += 1
-             if evidence:
-                 clues.append((word, prob))
-             num_clues += 1
-             P *= 1.0 - prob
-             Q *= prob
-             if P < 1e-200:  # move back into range
-                 P, e = frexp(P)
-                 Pexp += e
-             if Q < 1e-200:  # move back into range
-                 Q, e = frexp(Q)
-                 Qexp += e
- 
-         P, e = frexp(P)
-         Pexp += e
-         Q, e = frexp(Q)
-         Qexp += e
- 
-         if num_clues:
-             #P = 1.0 - P**(1./num_clues)
-             #Q = 1.0 - Q**(1./num_clues)
-             #
-             # (x*2**e)**n = x**n * 2**(e*n)
-             n = 1.0 / num_clues
-             P = 1.0 - P**n * 2.0**(Pexp * n)
-             Q = 1.0 - Q**n * 2.0**(Qexp * n)
- 
-             prob = (P-Q)/(P+Q)  # in -1 .. 1
-             prob = 0.5 + prob/2 # shift to 0 .. 1
-         else:
-             prob = 0.5
- 
-         if evidence:
-             clues.sort(lambda a, b: cmp(a[1], b[1]))
-             return prob, clues
-         else:
-             return prob
- 
-     if options.use_robinson_combining:
-         spamprob = robinson_spamprob
- 
-     def robinson_update_probabilities(self):
-         """Update the word probabilities in the spam database.
- 
-         This computes a new probability for every word in the database,
-         so can be expensive.  learn() and unlearn() update the probabilities
-         each time by default.  Thay have an optional argument that allows
-         to skip this step when feeding in many messages, and in that case
-         you should call update_probabilities() after feeding the last
-         message and before calling spamprob().
-         """
- 
-         nham = float(self.nham or 1)
-         nspam = float(self.nspam or 1)
-         A = options.robinson_probability_a
-         X = options.robinson_probability_x
-         AoverX = A/X
-         for word, record in self.wordinfo.iteritems():
-             # Compute prob(msg is spam | msg contains word).
-             # This is the Graham calculation, but stripped of biases, and
-             # of clamping into 0.01 thru 0.99.
-             hamcount = min(record.hamcount, nham)
-             hamratio = hamcount / nham
- 
-             spamcount = min(record.spamcount, nspam)
-             spamratio = spamcount / nspam
- 
-             prob = spamratio / (hamratio + spamratio)
- 
-             # Now do Robinson's Bayesian adjustment.
-             #
-             #         a + (n * p(w))
-             # f(w) = ---------------
-             #          (a / x) + n
- 
-             n = hamcount + spamcount
-             prob = (A + n * prob) / (AoverX + n)
- 
-             if record.spamprob != prob:
-                 record.spamprob = prob
-                 # The next seemingly pointless line appears to be a hack
-                 # to allow a persistent db to realize the record has changed.
-                 self.wordinfo[word] = record
- 
-     if options.use_robinson_probability:
-         update_probabilities = robinson_update_probabilities
- 
      def central_limit_compute_population_stats(self, msgstream, is_spam):
          from math import ldexp
--- 354,357 ----
***************
*** 745,751 ****
      if options.use_central_limit:
          spamprob = central_limit_spamprob
- 
- 
- 
  
      def central_limit_compute_population_stats2(self, msgstream, is_spam):
--- 438,441 ----