OT: spam filtering idea
skip at pobox.com
Tue Jan 14 17:11:58 CET 2003
(I haven't the slightest idea if Paul's email address is valid. Sure looks
Paul> Indeed. However, I am seeing a lot of "minimalist" spam which is
Paul> obviously intended to evade body filtering: usually just a URL and
Paul> a hashbuster. I imagine that they're banking on people being
Paul> curious enough to click the link. I'm planning on dealing with
Paul> short spam like this by looking up the website host IP in
Paul> blacklists, but it's not quite enough of a problem to worry about
Spambayes already looks at URLs. Minimalist url-containing spam such as you
mention tends to wind up "unsure" until I train on it. Recent case in
point, lots of spam coming from "big at boss.com". Your message had nearly 20
url:* tokens in it according to Spambayes tokenizer (sorted here from hammy
And it's faster (and probably more accurate) than consulting an off-site
oracle to boot.
I've been using Spambayes since before November 1 (my oldest .procmailrc
backup file). I see no false positives and a modest number of unsures and
false negatives. Much better than pre-Spambayes (which was SpamAssassin).
After the initial big training run (lots of both ham and spam), I have only
been training on unsure or incorrectly classified messages.
More information about the Python-list