Graham's spam filter

Christopher Browne cbbrowne at
Thu Aug 22 21:53:38 EDT 2002

In an attempt to throw the authorities off his trail, Paul Rubin <phr-n2002b at> transmitted:
> Neale Pickett <neale at> writes:
>> One thing you *should* do, though, is skip base64-encoded stuff.  That
>> will just clutter up your database.

> You can't skip base64-encoded stuff since a lot of it is spam.  You
> have to decode it and filter it.

Ah, but the fact that there's a chunk of base64-encoded material is a
piece of data.  Create a 'base64' element, and count it.  Works like a
charm.  (Throw it away, and you're left with little more than header
data, which is also Statistically Highly Significant, which _also_
works like a charm.)

There's lots about this that _isn't_ intuitively obvious unless you
think very carefully about the math...
