FW: [spambayes-dev] Results for DNS lookup in tokenizer
sethg at GoodmanAssociates.com
Sun Apr 11 14:33:53 EDT 2004
> From: Phillip J. Eby
> Sent: Sunday, April 11, 2004 11:55 AM
> At 08:25 AM 4/11/04 -0400, spambayes-dev-request at python.org wrote:
> >I'll restate my question. What does Matt's proposal do that
> >mine_received_headers doesn't do already?
> It looks at URLs embedded in the message *body*. ...
That's _exactly_ what I was getting at. Mine_received_headers only looks at
headers, which don't contain the IP's of spamvertised sites. Much of, if
not most, spam today comes direct-to-MX from compromised windows boxes
operating on broadband, dynamic IP connections from providers that don't
limit customers' use of outgoing port 25 connections.
The theory, if it is worth anything, is that the total size of the IP
address space for "bad-boy" hosting service web-servers is puny compared
with the dynamic IP pools of major providers who do not block outgoing port
25 connections. Having the token database learn the former is feasible,
while having it learn the latter is pretty hopeless.
For exactly the same reason, I would guess that the message source IP is
probably better at identifying ham than spam. For this property alone, it
is extremely valuable. My friends' tendency to use an occasional spammy
word is partially offset by the strong ham clues from their outgoing MTA IP
and their personal email address. In terms of detecting spam, the token
database does a great job at detecting repetitive spam sources, but is
somewhat ill-suited for the dynamic IP phenomenon. Rather than have the
token database learn to be a mediocre dynamic IP blacklist, it would
probably be better to use a proxy to query a real dynamic IP blacklist and
add a header for SpamBayes to mine. However, that's outside the scope of
More information about the spambayes-dev