Graham's spam filter

Fernando Pereira pereira at cis.upenn.edu
Thu Sep 5 22:59:50 EDT 2002


On 9/5/02 6:10 PM, in article 3D77D657.DE280373 at alcyone.com, "Erik Max
Francis" <max at alcyone.com> wrote:

> Christopher Browne wrote:
> 
>> Quoth Erik Max Francis <max at alcyone.com>:
>> 
>>> Spammers are hitting upon the strategy, though, of sending emails in
>>> which the body consists of nothing but a completely encoded base64
>>> MIME part.  So in that case, the entire body of your message would
>>> consist solely of your "base64encoded" token.  So in the general
>>> case of any kind of spam filter (not just limited to a Graham
>                                    ^
>>> filter), it's questionable how useful this will be, unless you plant
>>> to always filter against that token, presuming it to always indicate
>>> spam.
>> 
>> I've been using naive Bayesian filtering for years; I don't assume
>> that _any_ particular token indicates _any_ particular result.
> 
> I thought I made it clear that I was discussing spam filters in general,
> not just Graham/Bayesian filters.
That is, you overstated the case. Your point may be true of rule-based
filters that look for particular words presumed to be spam indicators. But a
machine-learning approach using a rich set of message features will pick on
*anything* that is correlated with spamminess, not just the obvious
features. In an arms race, the spammer may gain temporary advantage by
changing message format and thus which features are significant, but after
receiving a few new spams, retraining will pick on other appropriate
features, for instance fraction of the message bytes that are base-64
encoded. Occasionally, filter designers may want to add new features based
on some exploratory analysis of spam, but the decision of which of those are
relevant is left to the learning algorithm.

-- F




More information about the Python-list mailing list