Graham's spam filter
cbbrowne at acm.org
Thu Sep 5 17:33:24 CEST 2002
Centuries ago, Nostradamus foresaw when Erik Max Francis <max at alcyone.com> would write:
> Aaron Swartz wrote:
>> I've been using bogofilter, Eric Raymond's Graham-derived spam
>> filter which threw away base64-encoded data and 90% of all spam that
>> got past the filter was base64-encoded. Therefore, I think that base64
>> content really needs to be decoded. I wrote a base64-decoding filter
>> in Python for it and the problem has gone away.
> Indeed. I've been finding very much the same thing with my rule-based
> filter; about 90% of the spam that's getting through is base64 encoded.
> I haven't yet taken the next step of automatically decoding the base64
> text parts (and then just processing that), but as you have discovered
> it is an obvious solution to the obvious problem.
Have you considered simply replacing strings that appear to be
base64-encoded with a token like "base64-text"?
That allows the database to at least be aware that the spam commonly
contains base64 data.
-> Supposing there is interesting text encoded (such as source code
for a virus) inside the base64 stuff, it _would_ be useful to
-> Supposing the base64 stuff is basically just a GIF/JPEG/PNG, or
something else that doesn't contain "interesting text," you'll
have not much of value from the decoding process.
Making the "tokenizing" step a tad smarter (e.g. - recognizing "this
is likely base 64" and collecting stats on numbers of lines of base64
material) requires minimal added effort, and I expect it would buy you
_most_ of the benefits of decoding.
(concatenate 'string "chris" "@cbbrowne.com")
Let me control a planet's oxygen supply and I don't care who makes the
More information about the Python-list