[Python-checkins] python/nondist/sandbox/spambayes GBayes.py,1.9,1.10

Thu, 22 Aug 2002 20:10:44 -0700

Update of /cvsroot/python/python/nondist/sandbox/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv3157

Modified Files:
	GBayes.py 
Log Message:
spamprob():  Commented some subtleties.

clearjunk():  Undid Guido's attempt to space-optimize this.  The problem
is that you can't delete entries from a dict that's being crawled over
by .iteritems(), which is why I (I suddenly recall) materialized a
list of words to be deleted the first time I wrote this.  It's a lot
better to materialize a list of to-be-deleted words than to materialize
the entire database in a dict.items() list.

Index: GBayes.py
===================================================================
RCS file: /cvsroot/python/python/nondist/sandbox/spambayes/GBayes.py,v
retrieving revision 1.9
retrieving revision 1.10
diff -C2 -d -r1.9 -r1.10
*** GBayes.py	21 Aug 2002 21:01:47 -0000	1.9
--- GBayes.py	23 Aug 2002 03:10:42 -0000	1.10
***************
*** 160,166 ****
--- 160,183 ----
              distance = abs(prob - 0.5)
              if distance > smallest_best:
+                 # Subtle:  we didn't use ">" instead of ">=" just to save
+                 # calls to heapreplace().  The real intent is that if
+                 # there are many equally strong indicators throughout the
+                 # message, we want to favor the ones that appear earliest:
+                 # it's expected that spam headers will often have smoking
+                 # guns, and, even when not, spam has to grab your attention
+                 # early (& note that when spammers generate large blocks of
+                 # random gibberish to throw off exact-match filters, it's
+                 # always at the end of the msg -- if they put it at the
+                 # start, *nobody* would read the msg).
                  heapreplace(nbest, (distance, prob, word, record))
                  smallest_best = nbest[0][0]

+         # Compute the probability.  Note:  This is what Graham's code did,
+         # but it's dubious for reasons explained in great detail on Python-
+         # Dev:  it's missing P(spam) and P(not-spam) adjustments that
+         # straightforward Bayesian analysis says should be here.  It's
+         # unclear how much it matters, though, as the omissions here seem
+         # to tend in part to cancel out distortions introduced earlier by
+         # HAMBIAS.  Experiments will decide the issue.
          prob_product = inverse_prob_product = 1.0
          for distance, prob, word, record in nbest:
***************
*** 254,263 ****
          wordinfo = self.wordinfo
          mincount = float(mincount)
!         for w, r in wordinfo.iteritems():
!             if (r.atime < oldesttime and
!                 SPAMBIAS*r.spamcount + HAMBIAS*r.hamcount < mincount):
!                 if self.DEBUG:
!                     print "clearjunk removing word %r: %r" % (w, r)
!                 del wordinfo[w]

      def _add_msg(self, wordstream, is_spam):
--- 271,281 ----
          wordinfo = self.wordinfo
          mincount = float(mincount)
!         tonuke = [w for w, r in wordinfo.iteritems()
!                     if r.atime < oldesttime and
!                        SPAMBIAS*r.spamcount + HAMBIAS*r.hamcount < mincount]
!         for w in tonuke:
!             if self.DEBUG:
!                 print "clearjunk removing word %r: %r" % (w, r)
!             del wordinfo[w]

      def _add_msg(self, wordstream, is_spam):