Graham's spam filter

Thu Aug 22 23:06:39 EDT 2002

Christopher Browne <cbbrowne at acm.org> writes:
> > You can't skip base64-encoded stuff since a lot of it is spam.  You
> > have to decode it and filter it.
> 
> Ah, but the fact that there's a chunk of base64-encoded material is a
> piece of data.  Create a 'base64' element, and count it.  Works like a
> charm.  (Throw it away, and you're left with little more than header
> data, which is also Statistically Highly Significant, which _also_
> works like a charm.)
> 
> There's lots about this that _isn't_ intuitively obvious unless you
> think very carefully about the math...

I don't understand this.  If you can classify spam based on just the
headers, there'd be no point to filtering the content, so we wouldn't
be talking about text corpi.  You have to filter on content as well.

And if you're going to filter content, you have to realize some
messages will be base64-encoded, and of those base64 messages, some
will be spam and others will be non-spam.  The idea of a spam filter
is to figure out which are which.  It can't do that without decoding
and examining them.