[Python-checkins] python/nondist/sandbox/spambayes classifier.py,1.7,1.8
tim_one@users.sourceforge.net
tim_one@users.sourceforge.net
Mon, 02 Sep 2002 13:57:48 -0700
Update of /cvsroot/python/python/nondist/sandbox/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv7837
Modified Files:
classifier.py
Log Message:
Added a new xspamprob() method, which computes the combined probability
"correctly", and a long comment block explaining what happened when I
tried it. There's something worth pursuing here (it greatly improves
the false negative rate), but this change alone pushes too many marginal
hams into the spam camp
Index: classifier.py
===================================================================
RCS file: /cvsroot/python/python/nondist/sandbox/spambayes/classifier.py,v
retrieving revision 1.7
retrieving revision 1.8
diff -C2 -d -r1.7 -r1.8
*** classifier.py 1 Sep 2002 07:22:27 -0000 1.7
--- classifier.py 2 Sep 2002 20:57:46 -0000 1.8
***************
*** 206,209 ****
--- 206,384 ----
return prob
+ # The same as spamprob(), except uses a corrected probability computation
+ # accounting for P(spam) and P(not-spam). Since my training corpora had
+ # a ham/spam ratio of 4000/2750, I'm in a good position to test this.
+ # Using xspamprob() clearly made a major reduction in the false negative
+ # rate, cutting it in half on some runs (this is after the f-n rate had
+ # already been cut by a factor of 5 via other refinements). It also
+ # uncovered two more very brief spams hiding in the ham corpora.
+ #
+ # OTOH, the # of fps increased. Especially vulnerable are extremely
+ # short msgs of the "subscribe me"/"unsubscribe me" variety (while these
+ # don't belong on a mailing list, they're not spam), and brief reasonable
+ # msgs that simply don't have much evidence (to the human eye) to go on.
+ # These were boderline before, and it's easy to push them over the edge.
+ # For example, one f-p had subject
+ #
+ # Any Interest in EDIFACT Parser/Generator?
+ #
+ # and the body just
+ #
+ # Just curious.
+ # --jim
+ #
+ # "Interest" in the subject line had spam prob 0.99, "curious." 0.01,
+ # and nothing else was strong. Since my ham/spam ratio is bigger than
+ # 1, any clue favoring spam favors spam more strongly under xspamprob()
+ # than under spamprob().
+ #
+ # XXX Somewhat like spamprob(), learn() also computes probabilities as
+ # XXX if the # of hams and spams were the same. If that were also
+ # XXX fiddled to take nham and nspam into account (nb: I realize it
+ # XXX already *looks* like it does -- but it doesn't), it would reduce
+ # XXX the spam probabilities in my test run, and *perhaps* xspamprob
+ # XXX wouldn't have such bad effect on the f-p story.
+ #
+ # Here are the comparative stats, with spamprob() in the left column and
+ # xspamprob() in the right, across 20 runs:
+ #
+ # false positive percentages
+ # 0.000 0.000 tied
+ # 0.000 0.050 lost
+ # 0.050 0.100 lost
+ # 0.000 0.075 lost
+ # 0.025 0.050 lost
+ # 0.025 0.100 lost
+ # 0.050 0.150 lost
+ # 0.025 0.050 lost
+ # 0.025 0.050 lost
+ # 0.000 0.050 lost
+ # 0.075 0.150 lost
+ # 0.050 0.075 lost
+ # 0.025 0.050 lost
+ # 0.000 0.050 lost
+ # 0.050 0.125 lost
+ # 0.025 0.075 lost
+ # 0.025 0.025 tied
+ # 0.000 0.025 lost
+ # 0.025 0.100 lost
+ # 0.050 0.150 lost
+ #
+ # won 0 times
+ # tied 2 times
+ # lost 18 times
+ #
+ # total unique fp went from 8 to 30
+ #
+ # false negative percentages
+ # 0.945 0.473 won
+ # 0.836 0.582 won
+ # 1.200 0.618 won
+ # 1.418 0.836 won
+ # 1.455 0.836 won
+ # 1.091 0.691 won
+ # 1.091 0.618 won
+ # 1.236 0.691 won
+ # 1.564 1.018 won
+ # 1.236 0.618 won
+ # 1.563 0.981 won
+ # 1.563 0.800 won
+ # 1.236 0.618 won
+ # 0.836 0.400 won
+ # 0.873 0.400 won
+ # 1.236 0.545 won
+ # 1.273 0.691 won
+ # 1.018 0.327 won
+ # 1.091 0.473 won
+ # 1.490 0.618 won
+ #
+ # won 20 times
+ # tied 0 times
+ # lost 0 times
+ #
+ # total unique fn went from 292 to 162
+
+ def xspamprob(self, wordstream, evidence=False):
+ """Return best-guess probability that wordstream is spam.
+
+ wordstream is an iterable object producing words.
+ The return value is a float in [0.0, 1.0].
+
+ If optional arg evidence is True, the return value is a pair
+ probability, evidence
+ where evidence is a list of (word, probability) pairs.
+ """
+
+ if self.DEBUG:
+ print "spamprob(%r)" % wordstream
+
+ # A priority queue to remember the MAX_DISCRIMINATORS best
+ # probabilities, where "best" means largest distance from 0.5.
+ # The tuples are (distance, prob, word, wordinfo[word]).
+ nbest = [(-1.0, None, None, None)] * MAX_DISCRIMINATORS
+ smallest_best = -1.0
+
+ # Counting a unique word multiple times hurts, although counting one
+ # at most two times had some benefit whan UNKNOWN_SPAMPROB was 0.2.
+ # When that got boosted to 0.5, counting more than once became
+ # counterproductive.
+ unique_words = {}
+
+ wordinfoget = self.wordinfo.get
+ now = time.time()
+
+ for word in wordstream:
+ if word in unique_words:
+ continue
+ unique_words[word] = 1
+
+ record = wordinfoget(word)
+ if record is None:
+ prob = UNKNOWN_SPAMPROB
+ else:
+ record.atime = now
+ prob = record.spamprob
+
+ distance = abs(prob - 0.5)
+ if distance > smallest_best:
+ # Subtle: we didn't use ">" instead of ">=" just to save
+ # calls to heapreplace(). The real intent is that if
+ # there are many equally strong indicators throughout the
+ # message, we want to favor the ones that appear earliest:
+ # it's expected that spam headers will often have smoking
+ # guns, and, even when not, spam has to grab your attention
+ # early (& note that when spammers generate large blocks of
+ # random gibberish to throw off exact-match filters, it's
+ # always at the end of the msg -- if they put it at the
+ # start, *nobody* would read the msg).
+ heapreplace(nbest, (distance, prob, word, record))
+ smallest_best = nbest[0][0]
+
+ # Compute the probability.
+ if evidence:
+ clues = []
+ sp = float(self.nspam) / (self.nham + self.nspam)
+ hp = 1.0 - sp
+ prob_product = sp
+ inverse_prob_product = hp
+ for distance, prob, word, record in nbest:
+ if prob is None: # it's one of the dummies nbest started with
+ continue
+ if record is not None: # else wordinfo doesn't know about it
+ record.killcount += 1
+ if evidence:
+ clues.append((word, prob))
+ if self.DEBUG:
+ print 'nbest P(%r) = %g' % (word, prob)
+ prob_product *= prob / sp
+ inverse_prob_product *= (1.0 - prob) / hp
+
+ prob = prob_product / (prob_product + inverse_prob_product)
+ if evidence:
+ return prob, clues
+ else:
+ return prob
+
+
def learn(self, wordstream, is_spam, update_probabilities=True):
"""Teach the classifier by example.