[Python-checkins] python/nondist/sandbox/spambayes classifier.py,1.7,1.8

tim_one@users.sourceforge.net tim_one@users.sourceforge.net
Mon, 02 Sep 2002 13:57:48 -0700


Update of /cvsroot/python/python/nondist/sandbox/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv7837

Modified Files:
	classifier.py 
Log Message:
Added a new xspamprob() method, which computes the combined probability
"correctly", and a long comment block explaining what happened when I
tried it.  There's something worth pursuing here (it greatly improves
the false negative rate), but this change alone pushes too many marginal
hams into the spam camp


Index: classifier.py
===================================================================
RCS file: /cvsroot/python/python/nondist/sandbox/spambayes/classifier.py,v
retrieving revision 1.7
retrieving revision 1.8
diff -C2 -d -r1.7 -r1.8
*** classifier.py	1 Sep 2002 07:22:27 -0000	1.7
--- classifier.py	2 Sep 2002 20:57:46 -0000	1.8
***************
*** 206,209 ****
--- 206,384 ----
              return prob
  
+     # The same as spamprob(), except uses a corrected probability computation
+     # accounting for P(spam) and P(not-spam).  Since my training corpora had
+     # a ham/spam ratio of 4000/2750, I'm in a good position to test this.
+     # Using xspamprob() clearly made a major reduction in the false negative
+     # rate, cutting it in half on some runs (this is after the f-n rate had
+     # already been cut by a factor of 5 via other refinements).  It also
+     # uncovered two more very brief spams hiding in the ham corpora.
+     #
+     # OTOH, the # of fps increased.  Especially vulnerable are extremely
+     # short msgs of the "subscribe me"/"unsubscribe me" variety (while these
+     # don't belong on a mailing list, they're not spam), and brief reasonable
+     # msgs that simply don't have much evidence (to the human eye) to go on.
+     # These were boderline before, and it's easy to push them over the edge.
+     # For example, one f-p had subject
+     #
+     #     Any Interest in EDIFACT Parser/Generator?
+     #
+     # and the body just
+     #
+     #     Just curious.
+     #     --jim
+     #
+     # "Interest" in the subject line had spam prob 0.99, "curious." 0.01,
+     # and nothing else was strong.  Since my ham/spam ratio is bigger than
+     # 1, any clue favoring spam favors spam more strongly under xspamprob()
+     # than under spamprob().
+     #
+     # XXX Somewhat like spamprob(), learn() also computes probabilities as
+     # XXX if the # of hams and spams were the same.  If that were also
+     # XXX fiddled to take nham and nspam into account (nb:  I realize it
+     # XXX already *looks* like it does -- but it doesn't), it would reduce
+     # XXX the spam probabilities in my test run, and *perhaps* xspamprob
+     # XXX wouldn't have such bad effect on the f-p story.
+     #
+     # Here are the comparative stats, with spamprob() in the left column and
+     # xspamprob() in the right, across 20 runs:
+     #
+     #    false positive percentages
+     #        0.000  0.000  tied
+     #        0.000  0.050  lost
+     #        0.050  0.100  lost
+     #        0.000  0.075  lost
+     #        0.025  0.050  lost
+     #        0.025  0.100  lost
+     #        0.050  0.150  lost
+     #        0.025  0.050  lost
+     #        0.025  0.050  lost
+     #        0.000  0.050  lost
+     #        0.075  0.150  lost
+     #        0.050  0.075  lost
+     #        0.025  0.050  lost
+     #        0.000  0.050  lost
+     #        0.050  0.125  lost
+     #        0.025  0.075  lost
+     #        0.025  0.025  tied
+     #        0.000  0.025  lost
+     #        0.025  0.100  lost
+     #        0.050  0.150  lost
+     #
+     #    won   0 times
+     #    tied  2 times
+     #    lost 18 times
+     #
+     #    total unique fp went from 8 to 30
+     #
+     #    false negative percentages
+     #        0.945  0.473  won
+     #        0.836  0.582  won
+     #        1.200  0.618  won
+     #        1.418  0.836  won
+     #        1.455  0.836  won
+     #        1.091  0.691  won
+     #        1.091  0.618  won
+     #        1.236  0.691  won
+     #        1.564  1.018  won
+     #        1.236  0.618  won
+     #        1.563  0.981  won
+     #        1.563  0.800  won
+     #        1.236  0.618  won
+     #        0.836  0.400  won
+     #        0.873  0.400  won
+     #        1.236  0.545  won
+     #        1.273  0.691  won
+     #        1.018  0.327  won
+     #        1.091  0.473  won
+     #        1.490  0.618  won
+     #
+     #    won  20 times
+     #    tied  0 times
+     #    lost  0 times
+     #
+     #    total unique fn went from 292 to 162
+ 
+     def xspamprob(self, wordstream, evidence=False):
+         """Return best-guess probability that wordstream is spam.
+ 
+         wordstream is an iterable object producing words.
+         The return value is a float in [0.0, 1.0].
+ 
+         If optional arg evidence is True, the return value is a pair
+             probability, evidence
+         where evidence is a list of (word, probability) pairs.
+         """
+ 
+         if self.DEBUG:
+             print "spamprob(%r)" % wordstream
+ 
+         # A priority queue to remember the MAX_DISCRIMINATORS best
+         # probabilities, where "best" means largest distance from 0.5.
+         # The tuples are (distance, prob, word, wordinfo[word]).
+         nbest = [(-1.0, None, None, None)] * MAX_DISCRIMINATORS
+         smallest_best = -1.0
+ 
+         # Counting a unique word multiple times hurts, although counting one
+         # at most two times had some benefit whan UNKNOWN_SPAMPROB was 0.2.
+         # When that got boosted to 0.5, counting more than once became
+         # counterproductive.
+         unique_words = {}
+ 
+         wordinfoget = self.wordinfo.get
+         now = time.time()
+ 
+         for word in wordstream:
+             if word in unique_words:
+                 continue
+             unique_words[word] = 1
+ 
+             record = wordinfoget(word)
+             if record is None:
+                 prob = UNKNOWN_SPAMPROB
+             else:
+                 record.atime = now
+                 prob = record.spamprob
+ 
+             distance = abs(prob - 0.5)
+             if distance > smallest_best:
+                 # Subtle:  we didn't use ">" instead of ">=" just to save
+                 # calls to heapreplace().  The real intent is that if
+                 # there are many equally strong indicators throughout the
+                 # message, we want to favor the ones that appear earliest:
+                 # it's expected that spam headers will often have smoking
+                 # guns, and, even when not, spam has to grab your attention
+                 # early (& note that when spammers generate large blocks of
+                 # random gibberish to throw off exact-match filters, it's
+                 # always at the end of the msg -- if they put it at the
+                 # start, *nobody* would read the msg).
+                 heapreplace(nbest, (distance, prob, word, record))
+                 smallest_best = nbest[0][0]
+ 
+         # Compute the probability.
+         if evidence:
+             clues = []
+         sp = float(self.nspam) / (self.nham + self.nspam)
+         hp = 1.0 - sp
+         prob_product = sp
+         inverse_prob_product = hp
+         for distance, prob, word, record in nbest:
+             if prob is None:    # it's one of the dummies nbest started with
+                 continue
+             if record is not None:  # else wordinfo doesn't know about it
+                 record.killcount += 1
+             if evidence:
+                 clues.append((word, prob))
+             if self.DEBUG:
+                 print 'nbest P(%r) = %g' % (word, prob)
+             prob_product *= prob / sp
+             inverse_prob_product *= (1.0 - prob) / hp
+ 
+         prob = prob_product / (prob_product + inverse_prob_product)
+         if evidence:
+             return prob, clues
+         else:
+             return prob
+ 
+ 
      def learn(self, wordstream, is_spam, update_probabilities=True):
          """Teach the classifier by example.