Graham's spam filter (was Lisp to Python translation criticism?)

Wed Aug 21 00:10:30 EDT 2002

Many good points...

Creating all those initial categories might make the filter work even better
it's true, but it depends on a user being willing to spend the time to hand
sort some minimum number of messages per category to seed the filter.

I also wonder if it's needed. Graham's site claimed 5 missed (presumably
passed through as good) spams per thousand rejects and no false positives
(legitimate mail claimed as spam).

I think that after this step of processing mail (using this filter), it
might be feasible to reevaluate each message against a per folder generated
table that filters for messages that fit that folder. I noticed this evening
that that is how ifile works, and it uses a form of the same bayesian filter
as Graham.

Regards,

David LeBlanc
Seattle, WA USA

> -----Original Message-----
> From: python-list-admin at python.org
> [mailto:python-list-admin at python.org]On Behalf Of Christopher Browne
> Sent: Tuesday, August 20, 2002 20:10
> To: python-list at python.org
> Subject: Re: Graham's spam filter (was Lisp to Python translation
> criticism?)
>
>
> Oops! "David LeBlanc" <whisper at oz.net> was seen spray-painting on a wall:
> >> -----Original Message-----
> >> From: python-list-admin at python.org
> >> [mailto:python-list-admin at python.org]On Behalf Of Christopher Browne
> >> Sent: Tuesday, August 20, 2002 17:15
> >> To: python-list at python.org
> >> Subject: Re: Graham's spam filter (was Lisp to Python translation
> >> criticism?)
> >>
> >>
> > <snip>
> >> I'd suggest the thought of doing message header associations as
> >> tokens, so that you might get, out of:
> >>
> >>   Subject: Re: Graham's spam filter (was Lisp to Python
> >> translation criticism?)
> >>
> >> the set of tokens:
> >> subject::re
> >> subject::graham's
> > <snip>
> >> subject::Python
> >>
> >> Then do something similar with .signature material:
> >>
> >> signature::a
> >> signature::ago
> >> signature::been
> > <snip>
> >
> > What's the advantage of this?
>
> The advantage is that it discriminates between words in the header,
> words in the body, and words in the .signature.
>
> The whole point of the exercise is to do discrimination; the more
> useful criteria there are, the better.
>
> > <snip>
> >
> >> > One thing I don't see how to do is to add a corpus containing a new
> >> > message (good or bad) to the database - i.e. update the
> >> > database. Maybe Database.addGood() and Database.addBad()?
> >>
> >> It works a whopping lot better if there's a whopping lot more than
> >> just two categories...
> >
> > I agree that a complete mail program should have the ability to sort
> > mail into many categories and this phase of the operation is not
> > where to do it.  This is a pass/fail filtration step, not a sort
> > step.
>
> Then you are essentially seeking to have your system try to have two
> parameters:
>
>   -> What does the "average good email" look like, and
>   -> What does the "average bad email" look like.
>
> Since both of those characterize large "clouds" of entries, where, for
> instance:
>
>   -> "Good" email includes notes from friends, notes from technical
>       associates, and such, which have varying characteristics;
>
>   -> "Bad" email, where some have lots of "Nigerian Scam" words,
>       and others talk a lot about casinos, breast enlargement,
>       alternatives to Viagra, where to buy mailing lists, and such.
>
> If you merge the categories together, what you get is a cloudy sort of
> "average."
>
> Suppose a projection of relevance values onto the vector space of
> messages looks something like:
>
> +------------------------------------------------------------------+
> |                                   Mail from           Python     +
> |                                     Mom                Lists     +
> |     Nigerian    Snakeoil                                         +
> |      Scams                                                       +
> |                                                                  +
> |             + Spam Centroid                                      +
> |     Casinos                                                      +
> |                              School                              +
> |               Credit         Alumni          + Good Mail         +
> |                                                Centroid          +
> |                                                                  +
> |                                                                  +
> |                               Brothers                           +
> |                                                                  +
> |                                                                  +
> |                                                  DBMS Discussion |
> +------------------------------------------------------------------+
>
> (I'm pretending it makes sense to project this onto two dimensions.
> In a sense, there's a dimension for each word is considered, so that
> if there are 30000 words in your dictionary, there's a _PILE_ of
> dimensions!)
>
> If everything gets "averaged," then what you have are two categories,
> "good" and "bad," and whether something's "good" or "bad" depends on
> how close its value lies to the appropriate centroid.  (Two of them
> being labeled.)
>
> If you have a whole whack of categories, it means you're looking at
> nearness not to merely two "centroids," but rather look for the
> nearest centroid.  Note that the "cloud" around the 'Good Mail
> Centroid' is rather large.  In fact, in this diagram, mail from
> schoolmates may wind up looking as if it should be categorized as
> spam.
>
> I arbitrarily chose that; the point is that the simple "good versus
> bad" is something of an oversimplification.  You've got a lot of
> statistics, and you're not using them all.
>
> I would _definitely_ argue that having several spam folders to choose
> from should be helpful, as it allows taking advantage of the fact that
> (for instance) African Financial Scams have _really_ similar
> characteristics, and you can be _really_ confident that you've got a
> Nigerian Pyramid Scam.  That gives _greater_ certainty of appropriate
> message classifications.
> --
> (reverse (concatenate 'string "gro.mca@" "enworbbc"))
> http://www.ntlug.org/~cbbrowne/sgml.html
> "The Amiga  is proof that  if you build  a better mousetrap,  the rats
> will gang up on you."  -- Bill Roberts bill.roberts at ensco.com
> --
> http://mail.python.org/mailman/listinfo/python-list