Graham's spam filter

Erik Max Francis max at
Thu Sep 5 21:11:49 CEST 2002

Christopher Browne wrote:

> Have you considered simply replacing strings that appear to be
> base64-encoded with a token like "base64-text"?
> That allows the database to at least be aware that the spam commonly
> contains base64 data.

Well, that really depends on what your goal is.  Again, if you're one of
those people that has a very tight circle of email buddies and so
essentially any unsolicited email is by definition spam, then you can
tighten down your spam filter in all kinds of very powerful ways.

I, as I've mentioned before, receive unsolicited email from my Web sites
and various projects, and so unfortunately don't have the luxury of
doing this.  So I need to support receiving email from faraway lands and
unknown email addresses as well as trying to vigorously filter spam.

Fact is, unfortunately, lots of people send legitimate email that is
MIME encoded.

> -> Supposing there is interesting text encoded (such as source code
>    for a virus) inside the base64 stuff, it _would_ be useful to
>    decode it;
> -> Supposing the base64 stuff is basically just a GIF/JPEG/PNG, or
>    something else that doesn't contain "interesting text," you'll
>    have not much of value from the decoding process.
> Making the "tokenizing" step a tad smarter (e.g. - recognizing "this
> is likely base 64" and collecting stats on numbers of lines of base64
> material) requires minimal added effort, and I expect it would buy you
> _most_ of the benefits of decoding.

Spammers are hitting upon the strategy, though, of sending emails in
which the body consists of nothing but a completely encoded base64 MIME
part.  So in that case, the entire body of your message would consist
solely of your "base64encoded" token.  So in the general case of any
kind of spam filter (not just limited to a Graham filter), it's
questionable how useful this will be, unless you plant to always filter
against that token, presuming it to always indicate spam.

 Erik Max Francis / max at /
 __ San Jose, CA, US / 37 20 N 121 53 W / ICQ16063900 / &tSftDotIotE
/  \ There is nothing so subject to the inconstancy of fortune as war.
\__/ Miguel de Cervantes
    Church /
 A lambda calculus explorer in Python.

More information about the Python-list mailing list