FW: [spambayes-dev] Results for DNS lookup in tokenizer

Sun Apr 11 12:55:05 EDT 2004

At 08:25 AM 4/11/04 -0400, spambayes-dev-request at python.org wrote:
>I'll restate my question.  What does Matt's proposal do that
>mine_received_headers doesn't do already?

It looks at URLs embedded in the message *body*.  As a simple contrast, if 
I link here to:

http://enlarge-my-spam.com?id=123456

That will produce a very *different* set of IP tokens than the Received: 
headers of this message.  And, if the same spam is sent from a thousand 
compromised PC's, they will all still have the same URL IP cues, despite 
lacking any Received: headers in common.  Yes, they'll also have tokens 
representing parts of the domain name, but spammers can cheaply change 
their domain names to avoid being recognized.

Their website IP addresses are not only harder to change, but take 
advantage of the fact that so-called "bulletproof hosting" providers are a 
"bad neighborhood" for links.  So, if you train on these tokens, then you 
could potentially nail entirely unrelated spammers who simply host with the 
same ISP.

Of course, the spammers' next move would likely be to use redirects from 
non-"bulletproof" hosts, but everything we can do to make it more difficult 
and more costly for them is a good thing.