Graham's spam filter (was Lisp to Python translation criticism?)
whisper at oz.net
Wed Aug 21 06:10:30 CEST 2002
Many good points...
Creating all those initial categories might make the filter work even better
it's true, but it depends on a user being willing to spend the time to hand
sort some minimum number of messages per category to seed the filter.
I also wonder if it's needed. Graham's site claimed 5 missed (presumably
passed through as good) spams per thousand rejects and no false positives
(legitimate mail claimed as spam).
I think that after this step of processing mail (using this filter), it
might be feasible to reevaluate each message against a per folder generated
table that filters for messages that fit that folder. I noticed this evening
that that is how ifile works, and it uses a form of the same bayesian filter
Seattle, WA USA
> -----Original Message-----
> From: python-list-admin at python.org
> [mailto:python-list-admin at python.org]On Behalf Of Christopher Browne
> Sent: Tuesday, August 20, 2002 20:10
> To: python-list at python.org
> Subject: Re: Graham's spam filter (was Lisp to Python translation
> Oops! "David LeBlanc" <whisper at oz.net> was seen spray-painting on a wall:
> >> -----Original Message-----
> >> From: python-list-admin at python.org
> >> [mailto:python-list-admin at python.org]On Behalf Of Christopher Browne
> >> Sent: Tuesday, August 20, 2002 17:15
> >> To: python-list at python.org
> >> Subject: Re: Graham's spam filter (was Lisp to Python translation
> >> criticism?)
> > <snip>
> >> I'd suggest the thought of doing message header associations as
> >> tokens, so that you might get, out of:
> >> Subject: Re: Graham's spam filter (was Lisp to Python
> >> translation criticism?)
> >> the set of tokens:
> >> subject::re
> >> subject::graham's
> > <snip>
> >> subject::Python
> >> Then do something similar with .signature material:
> >> signature::a
> >> signature::ago
> >> signature::been
> > <snip>
> > What's the advantage of this?
> The advantage is that it discriminates between words in the header,
> words in the body, and words in the .signature.
> The whole point of the exercise is to do discrimination; the more
> useful criteria there are, the better.
> > <snip>
> >> > One thing I don't see how to do is to add a corpus containing a new
> >> > message (good or bad) to the database - i.e. update the
> >> > database. Maybe Database.addGood() and Database.addBad()?
> >> It works a whopping lot better if there's a whopping lot more than
> >> just two categories...
> > I agree that a complete mail program should have the ability to sort
> > mail into many categories and this phase of the operation is not
> > where to do it. This is a pass/fail filtration step, not a sort
> > step.
> Then you are essentially seeking to have your system try to have two
> -> What does the "average good email" look like, and
> -> What does the "average bad email" look like.
> Since both of those characterize large "clouds" of entries, where, for
> -> "Good" email includes notes from friends, notes from technical
> associates, and such, which have varying characteristics;
> -> "Bad" email, where some have lots of "Nigerian Scam" words,
> and others talk a lot about casinos, breast enlargement,
> alternatives to Viagra, where to buy mailing lists, and such.
> If you merge the categories together, what you get is a cloudy sort of
> Suppose a projection of relevance values onto the vector space of
> messages looks something like:
> | Mail from Python +
> | Mom Lists +
> | Nigerian Snakeoil +
> | Scams +
> | +
> | + Spam Centroid +
> | Casinos +
> | School +
> | Credit Alumni + Good Mail +
> | Centroid +
> | +
> | +
> | Brothers +
> | +
> | +
> | DBMS Discussion |
> (I'm pretending it makes sense to project this onto two dimensions.
> In a sense, there's a dimension for each word is considered, so that
> if there are 30000 words in your dictionary, there's a _PILE_ of
> If everything gets "averaged," then what you have are two categories,
> "good" and "bad," and whether something's "good" or "bad" depends on
> how close its value lies to the appropriate centroid. (Two of them
> being labeled.)
> If you have a whole whack of categories, it means you're looking at
> nearness not to merely two "centroids," but rather look for the
> nearest centroid. Note that the "cloud" around the 'Good Mail
> Centroid' is rather large. In fact, in this diagram, mail from
> schoolmates may wind up looking as if it should be categorized as
> I arbitrarily chose that; the point is that the simple "good versus
> bad" is something of an oversimplification. You've got a lot of
> statistics, and you're not using them all.
> I would _definitely_ argue that having several spam folders to choose
> from should be helpful, as it allows taking advantage of the fact that
> (for instance) African Financial Scams have _really_ similar
> characteristics, and you can be _really_ confident that you've got a
> Nigerian Pyramid Scam. That gives _greater_ certainty of appropriate
> message classifications.
> (reverse (concatenate 'string "gro.mca@" "enworbbc"))
> "The Amiga is proof that if you build a better mousetrap, the rats
> will gang up on you." -- Bill Roberts bill.roberts at ensco.com
More information about the Python-list