OT: spam filtering idea
Skip Montanaro
skip at pobox.com
Tue Jan 14 11:11:58 EST 2003
(I haven't the slightest idea if Paul's email address is valid. Sure looks
weird.)
Paul> Indeed. However, I am seeing a lot of "minimalist" spam which is
Paul> obviously intended to evade body filtering: usually just a URL and
Paul> a hashbuster. I imagine that they're banking on people being
Paul> curious enough to click the link. I'm planning on dealing with
Paul> short spam like this by looking up the website host IP in
Paul> blacklists, but it's not quite enough of a problem to worry about
Paul> yet.
Spambayes already looks at URLs. Minimalist url-containing spam such as you
mention tends to wind up "unsure" until I train on it. Recent case in
point, lots of spam coming from "big at boss.com". Your message had nearly 20
url:* tokens in it according to Spambayes tokenizer (sorted here from hammy
to spammy):
'url:python-list': 0.01
'url:selm': 0.01
'url:listinfo': 0.02
'url:mailman': 0.02
'url:python': 0.02
'url:demon': 0.05
'url:org': 0.06
'url:groups': 0.09
'url:mail': 0.09
'url:google': 0.14
'url': 0.35
'url:com': 0.63
'url:html': 0.68
'url:www': 0.69
'url:co': 0.85
'url:pobox': 0.90
'url:18': 0.92
'url:11': 0.94
And it's faster (and probably more accurate) than consulting an off-site
oracle to boot.
I've been using Spambayes since before November 1 (my oldest .procmailrc
backup file). I see no false positives and a modest number of unsures and
false negatives. Much better than pre-Spambayes (which was SpamAssassin).
After the initial big training run (lots of both ham and spam), I have only
been training on unsure or incorrectly classified messages.
Skip
More information about the Python-list
mailing list