FW: [spambayes-dev] Results for DNS lookup in tokenizer

Skip Montanaro skip at pobox.com
Sun Apr 11 08:25:15 EDT 2004


    Matt> It seems that it's easier for a spammer to find a compromised PC
    Matt> to relay though than it is for them to find someone willing to
    Matt> host a their site.

    Skip> In which case I doubt either of these network ip classification
    Skip> schemes will have much effect.

    Seth> I don't know, Matt may have a point here.  I've been getting a lot
    Seth> of salad spams ....  In such cases, could a strong spam clue, such
    Seth> as the netblock of a spamvertised web site, possibly push it from
    Seth> Unsure into Spam?  

Sure, if there are few tokens, one extra token may have a large enough
effect.  That wasn't the case I was referring to.  Matt was worried about
losing the occasional good message in a sea of spam on a few important
mailing lists.  If those good messages are fairly typical (or if he's
trained on a few of them), there are probably plenty of hammy tokens in each
one, in which case throwing in a netblock isn't going to add much.

    Seth> Even if you fragment the header IP addresses in the same way that
    Seth> Matt suggests (maybe you already do?), the sheer size of IP
    Seth> address space allocated to dynamic IP pools at major providers is
    Seth> orders of magnitude larger than the IP space of hosting services
    Seth> willing to host sites for enlargement products.  

Yes, I believe mine_received_headers does fragment in the same way as Matt's
scheme (minus the /(8,16,24,32) suffix which I think is superfluous), which
was why I mentioned it in the first place.

I think with mine_received_headers enabled we're already collecting the same
information (actually more in most instances, since all Received: headers
are parsed).  Here are some examples gotten using spamcounts (post-sorted by
the spam prob) from my current database.

* mail.python.org (slightly hammy):

    % spamcounts -r 'received:12.155'
    db: /Users/skip/.hammiedb
    token,nspam,nham,spam prob
    received:12.155,269,387,0.40438528783
    received:12.155.117,269,387,0.40438528783
    received:12.155.117.29,269,387,0.40438528783

* pobox.com, main relay for most of my mail (again, mostly mildly mildly
  hammy, though with some outliers):

    % spamcounts -r 'received:(208\.58|207\.8)'
    db: /Users/skip/.hammiedb
    token,nspam,nham,spam prob
    received:208.58.216,0,1,0.155172413793
    received:208.58.216.73,0,1,0.155172413793
    received:207.8.226.3,66,92,0.412197950796
    received:207.8.214.3,67,93,0.413216308473
    received:207.8.214,73,98,0.42129893514
    received:208.58.1.193,87,116,0.422927556996
    received:207.8,208,269,0.430284644233
    received:207.8.226,135,171,0.435415990012
    received:208.58,193,239,0.440949675391
    received:208.58.1,193,238,0.441982563175
    received:208.58.1.194,99,118,0.450429768447
    received:207.8.226.2,69,79,0.460422504704
    received:208.58.1.197,5,5,0.494310099573
    received:207.8.214.2,6,5,0.53799693756
    received:208.58.1.198,4,1,0.771713070997

* mail.mojam.com, where my mail eventually winds up (mildly spammy because I
  get lots of non-skip at mojam.com stuff there which is primarily spam):

    % spamcounts -r 'received:199.249'
    db: /Users/skip/.hammiedb
    token,nspam,nham,spam prob
    received:199.249.165.21,0,1,0.155172413793
    received:199.249.165.25,0,1,0.155172413793
    received:199.249,90,55,0.614718002838
    received:199.249.165,90,55,0.614718002838
    received:199.249.165.175,90,54,0.619037063122

Now I cheat and just sort all received: features by spam prob.  The highest
is 

    received:69.6,7,0,0.969798657718
    received:biz,7,0,0.969798657718

(perhaps not surprising).  Looking up some of the individual addresses in
the 69.6 block yields a bunch of "host not found" responses.  Also, not all
that surprising.

Looking at the other end of the spectrum, I see

    received:66.163,0,6,0.0348837209302

The ip's I have in that block refer to Yahoo's mail servers.  This suggests
to me they do a pretty good job keeping their relays closed to abuse.

    Seth> It seems that the hosting service IP's are more likely generate
    Seth> strong spam clues than the source IP's of the compromised windows
    Seth> boxes.  Whether this would ultimately make enough of a difference,
    Seth> I don't know.

Of course, whether or not this helps on any given message depends to a large
degree on how many other features the tokenizer extracts from the message.

Switching gears a bit, I suspect we could probably toss out the
received:N.N.N.N and received:N.N.N features and not lose much in the way of
accuracy since all but a few of them are hapaxes.  

    feature pattern             total           hapaxes
    ---------------             -----           -------
    received:N                   177             77 (44%)
    received:N.N                1606           1228 (76%)
    received:N.N.N              2140           1927 (90%)
    received:N.N.N.N            2548           2362 (93%)

Perhaps the same holds true for hostname-based features (received:biz,
received:creosote.python.org, etc), though it's less clear cut.  Perhaps
none of them are worth keeping:

    feature pattern             total           hapaxes
    ---------------             -----           -------
    received:a                   320            257 (80%)
    received:a.a                1046            867 (83%)
    received:a.a.a              1222           1062 (87%)
    received:a.a.a.a             682            609 (89%)

The above data are from my database which currently contains 102863 tokens.
If I removed all the three- and four-component received: features I'd reduce
the database size by about six percent.

I'll restate my question.  What does Matt's proposal do that
mine_received_headers doesn't do already?

Skip



More information about the spambayes-dev mailing list