Graham's spam filter

Christopher Browne cbbrowne at acm.org
Thu Sep 5 22:24:18 CEST 2002


Quoth Erik Max Francis <max at alcyone.com>:
> Christopher Browne wrote:
>> Have you considered simply replacing strings that appear to be
>> base64-encoded with a token like "base64-text"?
>> 
>> That allows the database to at least be aware that the spam
>> commonly contains base64 data.
>
> Well, that really depends on what your goal is.  Again, if you're
> one of those people that has a very tight circle of email buddies
> and so essentially any unsolicited email is by definition spam, then
> you can tighten down your spam filter in all kinds of very powerful
> ways.
>
> I, as I've mentioned before, receive unsolicited email from my Web
> sites and various projects, and so unfortunately don't have the
> luxury of doing this.  So I need to support receiving email from
> faraway lands and unknown email addresses as well as trying to
> vigorously filter spam.
>
> Fact is, unfortunately, lots of people send legitimate email that is
> MIME encoded.

No, I am certainly _not_ defining "all unsolicited email" as being
spam.  Quite to the contrary, I receive quite a lot of interesting
email from unexpected sources.  Very little of it, statistically
speaking, is heavily MIME encoded, mind you...

The MIME encoded stuff does _not_ solely consist of "base64" text; it
also has header information that at least _suggests_ file type info.

A recent virus email contained:

Content-Type: application/octet-stream;
        name=snoopy.exe
Content-Transfer-Encoding: base64
Content-ID: <UCuk0QbULj2h8t9F>

another had:

Content-Type: audio/x-wav;
        name=bgcolor.exe
Content-Transfer-Encoding: base64
Content-ID: <W0H1pml7>

I get legitimate mail that contains base64 material; it _never_, in my
experience, consists solely of base64 material.

It always contains _some_ sort of commentary, and whether that
commentary came as text or as HTML, it's quite nicely sufficient to
distinguish the "unexpected resumes from Russia" from the email
viruses.

>> -> Supposing there is interesting text encoded (such as source code
>>    for a virus) inside the base64 stuff, it _would_ be useful to
>>    decode it;
>> 
>> -> Supposing the base64 stuff is basically just a GIF/JPEG/PNG, or
>>    something else that doesn't contain "interesting text," you'll
>>    have not much of value from the decoding process.
>> 
>> Making the "tokenizing" step a tad smarter (e.g. - recognizing "this
>> is likely base 64" and collecting stats on numbers of lines of base64
>> material) requires minimal added effort, and I expect it would buy you
>> _most_ of the benefits of decoding.
>
> Spammers are hitting upon the strategy, though, of sending emails in
> which the body consists of nothing but a completely encoded base64
> MIME part.  So in that case, the entire body of your message would
> consist solely of your "base64encoded" token.  So in the general
> case of any kind of spam filter (not just limited to a Graham
> filter), it's questionable how useful this will be, unless you plant
> to always filter against that token, presuming it to always indicate
> spam.

I've been using naive Bayesian filtering for years; I don't assume
that _any_ particular token indicates _any_ particular result.

I'm not interested in the "rule-based" stuff, only in the schemes
based on statistical analysis.

And the body of the message would most certainly NOT consist solely of
a "base64encoded" token.

The body portion would consist of:

- Various "Content-foo" tokens
- The header information that these documents _do_ contain; they
  normally contain an HTML header.
- Not "solely a base64encoded token," but rather some sort of count
  involving  _many_ base64encoded tokens.

The notion that it's "solely one token" is in your imagination, not in
reality.  There are _no_ "presumptions" being made here.

What I'm saying, that apparently isn't being read, is that I expect
that collecting stats on the numbers of "base64 lines" is likely to be
_nearly_ as useful as decoding the contents, and that it's _certainly_
simpler and faster.

If it _proves_ insufficient as a discriminator (please feel free to
direct any nonsense about 'presuming anything to always indicate
spam' to /dev/null), then it might prove necessary to _try_ to analyze
the contents.

_Trying_ to decode and analyze the contents may still prove a futile
exercise.  You won't get much useful material out of such common MIME
contents as graphics, PDFs, ZIP files, and audio files, without going
to even _more_ gratuitous lengths to analyze them that might very well
make you vulnerable to DOS attacks directed against the mail filter
itself.
-- 
(reverse (concatenate 'string "gro.mca@" "enworbbc"))
http://www.ntlug.org/~cbbrowne/nonrdbms.html
"I will not send lard through the mail" ^ 100 -- Bart Simpson



More information about the Python-list mailing list