[Python-checkins] python/nondist/sandbox/spambayes GBayes.py,1.9,1.10
tim_one@users.sourceforge.net
tim_one@users.sourceforge.net
Thu, 22 Aug 2002 20:10:44 -0700
Update of /cvsroot/python/python/nondist/sandbox/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv3157
Modified Files:
GBayes.py
Log Message:
spamprob(): Commented some subtleties.
clearjunk(): Undid Guido's attempt to space-optimize this. The problem
is that you can't delete entries from a dict that's being crawled over
by .iteritems(), which is why I (I suddenly recall) materialized a
list of words to be deleted the first time I wrote this. It's a lot
better to materialize a list of to-be-deleted words than to materialize
the entire database in a dict.items() list.
Index: GBayes.py
===================================================================
RCS file: /cvsroot/python/python/nondist/sandbox/spambayes/GBayes.py,v
retrieving revision 1.9
retrieving revision 1.10
diff -C2 -d -r1.9 -r1.10
*** GBayes.py 21 Aug 2002 21:01:47 -0000 1.9
--- GBayes.py 23 Aug 2002 03:10:42 -0000 1.10
***************
*** 160,166 ****
--- 160,183 ----
distance = abs(prob - 0.5)
if distance > smallest_best:
+ # Subtle: we didn't use ">" instead of ">=" just to save
+ # calls to heapreplace(). The real intent is that if
+ # there are many equally strong indicators throughout the
+ # message, we want to favor the ones that appear earliest:
+ # it's expected that spam headers will often have smoking
+ # guns, and, even when not, spam has to grab your attention
+ # early (& note that when spammers generate large blocks of
+ # random gibberish to throw off exact-match filters, it's
+ # always at the end of the msg -- if they put it at the
+ # start, *nobody* would read the msg).
heapreplace(nbest, (distance, prob, word, record))
smallest_best = nbest[0][0]
+ # Compute the probability. Note: This is what Graham's code did,
+ # but it's dubious for reasons explained in great detail on Python-
+ # Dev: it's missing P(spam) and P(not-spam) adjustments that
+ # straightforward Bayesian analysis says should be here. It's
+ # unclear how much it matters, though, as the omissions here seem
+ # to tend in part to cancel out distortions introduced earlier by
+ # HAMBIAS. Experiments will decide the issue.
prob_product = inverse_prob_product = 1.0
for distance, prob, word, record in nbest:
***************
*** 254,263 ****
wordinfo = self.wordinfo
mincount = float(mincount)
! for w, r in wordinfo.iteritems():
! if (r.atime < oldesttime and
! SPAMBIAS*r.spamcount + HAMBIAS*r.hamcount < mincount):
! if self.DEBUG:
! print "clearjunk removing word %r: %r" % (w, r)
! del wordinfo[w]
def _add_msg(self, wordstream, is_spam):
--- 271,281 ----
wordinfo = self.wordinfo
mincount = float(mincount)
! tonuke = [w for w, r in wordinfo.iteritems()
! if r.atime < oldesttime and
! SPAMBIAS*r.spamcount + HAMBIAS*r.hamcount < mincount]
! for w in tonuke:
! if self.DEBUG:
! print "clearjunk removing word %r: %r" % (w, r)
! del wordinfo[w]
def _add_msg(self, wordstream, is_spam):