Graham's spam filter (was Lisp to Python translation criticism?)

David LeBlanc whisper at oz.net
Wed Aug 21 00:48:05 CEST 2002


> "Erik Max Francis" wrote:
>
> Here's my implementation of Graham's statistical filter in Python.  It's
> based on a Corpus class (a specialized dictionary) that processes data
> (each call of the .process method should be the entire concatenated text
> of a distinct message).  One builds up two corpora [had to look that one
> up!] -- good and bad -- and then hands them to a Database instance,
> which computes the appropriate probability table.  When you want to test
> a new message, create a Corpus for it and then pass it to the database's
> .scan method, which will return the computed probability of the message
> being spam.
>
> I've fiddled around with the code briefly and it doesn't look obviously
> wrong, though I still lack the definitive representative samples of spam
> and nonspam.
>

Looking it over, I wonder if some optimizations aren't possible or
desirable. One that came to mind is to retain url's/urn's as distinct
tokens.

<snip>
>
> One obvious and immediate issue is that for an industrial-strength
> filter, the database gets _huge_ (Graham's basic setup involved 4000
> messages each in the spam and nonspam corpora), and reading and writing
> the database (even with cPickle) each time a spam message comes through
> starts to become intensive.

I am going to build a version to use Metakit. Should be good for up to about
10Mb of messages if I read the Metakit site right.

One thing I don't see how to do is to add a corpus containing a new message
(good or bad) to the database - i.e. update the database. Maybe
Database.addGood() and Database.addBad()?

> --
>  Erik Max Francis / max at alcyone.com / http://www.alcyone.com/max/
>  __ San Jose, CA, US / 37 20 N 121 53 W / ICQ16063900 / &tSftDotIotE
> /  \ There is nothing so subject to the inconstancy of fortune as war.
> \__/ Miguel de Cervantes
>     Church / http://www.alcyone.com/pyos/church/
>  A lambda calculus explorer in Python.
> --

Here's a version of your code with the minor enhancement of a primative
driver.

import re

TOKEN_RE = re.compile(r"[a-zA-Z0-9'$_-]+")
FREQUENCY_THRESHHOLD = 5
GOOD_BIAS = 2.0
BAD_BIAS = 1.0
GOOD_PROB = 0.01
BAD_PROB = 0.99
UNKNOWN_PROB = 0.2

class Corpus(dict):
    def __init__(self, data=None):
        dict.__init__(self)
        self.count = 0
        if data is not None:
            self.process(data)

    def process(self, data):
        tokens = TOKEN_RE.findall(data)
        for token in tokens:
            if self.has_key(token):
                self[token] += 1
            else:
                self[token] = 1
        self.count += 1

class Database(dict):
    def __init__(self, good, bad):
        dict.__init__(self)
        self.build(good, bad)

    def build(self, good, bad):
        ngood = good.count
        nbad = bad.count
        for token in good.keys() + bad.keys(): # doubles up, but works
            if not self.has_key(token):
                g = GOOD_BIAS*good.get(token, 0)
                b = BAD_BIAS*bad.get(token, 0)
                if g + b >= FREQUENCY_THRESHHOLD:
                    goodMetric = min(1.0, g/ngood)
                    badMetric = min(1.0, b/nbad)
                    total = goodMetric + badMetric
                    prob = max(GOOD_PROB, min(BAD_PROB,badMetric/total))
                    self[token] = prob

    def scan(self, corpus):
        pairs = [(token, self.get(token, UNKNOWN_PROB)) \
                 for token in corpus.keys()]
        pairs.sort(lambda x, y: cmp(abs(y[1] - 0.5), abs(x[1] - 0.5)))
        significant = pairs[:15]
        inverseProduct = product = 1.0
        for token, prob in significant:
            product *= prob
            inverseProduct *= 1.0 - prob
        return product/(product + inverseProduct)

if __name__ == '__main__':
    import os, string

    # Build reference corpus
    bad = Corpus()
    good = Corpus()

    # Put your own path to spam/bad files here.
    # 1 message per .txt file is assumed. Messages include headers.
    os.chdir("l:/apps/python/email/grafilt/spam")
    names = os.listdir(os.curdir)
    for name in names:
        print "bad file: ", name
        fn = open(name)
        txt = fn.readlines()
        fn.close()
        bad.process("".join(txt))

    # Put your own path to good files here.
    # 1 message per .txt file is assumed. Messages include headers.
    os.chdir("l:/apps/python/email/grafilt/good")
    names = os.listdir(os.curdir)
    for name in names:
        print "good file: ", name
        fn = open(name)
        txt = fn.readlines()
        fn.close()
        good.process("".join(txt))

    # Reference corpus
    db = Database(good,bad)

    for key in db.keys():
        #print "key: %s value: %s" % (key, db[key])
        pass

    # Test message
    fn = open("l:/apps/python/email/grafilt/good/ZootForum Re Read only
Mode.txt")
    txt = fn.readlines()
    fn.close

    test = Corpus("".join(txt))
    print "%.3f" % db.scan(test)

---------------------------------------------------------

With a known good message, I keep getting 0.0000... from the Database.scan()
and I don't know if that's correct. With a known spam file I get 1.0.

Regards,

Dave LeBlanc
Seattle, WA USA





More information about the Python-list mailing list