OT: spam filtering idea

Skip Montanaro skip at pobox.com
Tue Jan 14 17:11:58 CET 2003


(I haven't the slightest idea if Paul's email address is valid.  Sure looks
weird.) 

    Paul> Indeed. However, I am seeing a lot of "minimalist" spam which is
    Paul> obviously intended to evade body filtering: usually just a URL and
    Paul> a hashbuster. I imagine that they're banking on people being
    Paul> curious enough to click the link. I'm planning on dealing with
    Paul> short spam like this by looking up the website host IP in
    Paul> blacklists, but it's not quite enough of a problem to worry about
    Paul> yet.

Spambayes already looks at URLs.  Minimalist url-containing spam such as you
mention tends to wind up "unsure" until I train on it.  Recent case in
point, lots of spam coming from "big at boss.com".  Your message had nearly 20
url:* tokens in it according to Spambayes tokenizer (sorted here from hammy
to spammy):

    'url:python-list': 0.01
    'url:selm': 0.01
    'url:listinfo': 0.02
    'url:mailman': 0.02
    'url:python': 0.02
    'url:demon': 0.05
    'url:org': 0.06
    'url:groups': 0.09
    'url:mail': 0.09
    'url:google': 0.14
    'url': 0.35
    'url:com': 0.63
    'url:html': 0.68
    'url:www': 0.69
    'url:co': 0.85
    'url:pobox': 0.90
    'url:18': 0.92
    'url:11': 0.94

And it's faster (and probably more accurate) than consulting an off-site
oracle to boot.

I've been using Spambayes since before November 1 (my oldest .procmailrc
backup file).  I see no false positives and a modest number of unsures and
false negatives.  Much better than pre-Spambayes (which was SpamAssassin).
After the initial big training run (lots of both ham and spam), I have only
been training on unsure or incorrectly classified messages.

Skip





More information about the Python-list mailing list