FW: [spambayes-dev] Results for DNS lookup in tokenizer

Seth Goodman sethg at GoodmanAssociates.com
Sun Apr 11 01:46:51 EDT 2004

> From: Skip Montanaro
> Sent: Saturday, April 10, 2004 10:48 PM


>     Matt> It seems that it's easier for a spammer to find a compromised PC
>     Matt> to relay though than it is for them to find someone willing to
>     Matt> host a their site.
> In which case I doubt either of these network ip classification
> schemes will
> have much effect.

I don't know, Matt may have a point here.  I've been getting a lot of salad
spams that mostly end up in the Unsure folder and tend to score somewhat
neutral.  Many of them do not even use real words to dilute the sales pitch,
they use random combinations of letters separated by white space so there
are relatively few significant tokens.  It's not the smartest strategy, but
I've seen quite a bit of it.  In such cases, could a strong spam clue, such
as the netblock of a spamvertised web site, possibly push it from Unsure
into Spam?  I don't have a feel for Chi-squared combining so this is a
question, not an assertion.

I agree with Matt that because of the huge number of compromised windows
boxes with cables modems on providers (like Comcast) that do not restrict
outgoing port 25 connections to their smarthost, the chance of getting two
spams from the same compromised box are almost nil.  Even if you fragment
the header IP addresses in the same way that Matt suggests (maybe you
already do?), the sheer size of IP address space allocated to dynamic IP
pools at major providers is orders of magnitude larger than the IP space of
hosting services willing to host sites for enlargement products.  It seems
that the hosting service IP's are more likely generate strong spam clues
than the source IP's of the compromised windows boxes.  Whether this would
ultimately make enough of a difference, I don't know.


Seth Goodman

More information about the spambayes-dev mailing list