[spambayes-dev] Results for DNS lookup in tokenizer

Sat Apr 10 17:05:51 EDT 2004

I've lately been getting a bunch of spam that's almost entirely
nonsense except for a link or two. Perhaps not surprisingly,
SpamBayes hasn't been catching it all that well.

I could probably improve SpamBayes's performance by turning on more
header checks but on account of some peculiarities of my email, I'm
reluctant to do that. (I read various postmaster, webmaster, and ARIN
contact addresses that get almost nothing but spam but it's important
that I see what little legitimate mail goes to them.)

I don't remember who mentioned it here first, but it seemed to me
that adding a DNS lookup for URLs to the tokenizer would be a good
idea. There's hardly any limit to the number of domains a spammer can
register, but the number of networks that are willing to host a
spammer's website seems to be reasonably small. So I hacked the
tokenizer to generate tokens for the address that a URL in a message
resolves to. It generates four tokens for each address, stripping
values from the dotted-quad from right to left. That is, 10.1.2.3
would generate:

url-ip:10/8
url-ip:10.1/16
url-ip:10.1.2/24
url-ip:10.1.2.3/32

(I realize that that's not how networks are allocated these days, but
byte boundaries seemed as good an arbitrary place to make the cuts as
any other.)

A day's worth of unscientific testing suggested that it works pretty
well; the new tokens quickly started to show up in the classifier's
evidence.

So I set up buckets for a 5-way cross-validation set and ran
timcv.py. The only classification difference between the two runs is
that unsures dropped from 27 to 25. Here's the output from cmp.py for
those who can interpret it better than I can:

nodnss.txt -> dnss.txt
-> <stat> tested 200 hams & 200 spams against 800 hams & 800 spams
-> <stat> tested 200 hams & 200 spams against 800 hams & 800 spams
-> <stat> tested 200 hams & 200 spams against 800 hams & 800 spams
-> <stat> tested 200 hams & 200 spams against 800 hams & 800 spams
-> <stat> tested 200 hams & 200 spams against 800 hams & 800 spams
-> <stat> tested 200 hams & 200 spams against 800 hams & 800 spams
-> <stat> tested 200 hams & 200 spams against 800 hams & 800 spams
-> <stat> tested 200 hams & 200 spams against 800 hams & 800 spams
-> <stat> tested 200 hams & 200 spams against 800 hams & 800 spams
-> <stat> tested 200 hams & 200 spams against 800 hams & 800 spams

false positive percentages
    0.000  0.000  tied          
    0.000  0.000  tied          
    0.500  0.500  tied          
    0.000  0.000  tied          
    0.000  0.000  tied          

won   0 times
tied  5 times
lost  0 times

total unique fp went from 1 to 1 tied          
mean fp % went from 0.1 to 0.1 tied          

false negative percentages
    0.000  0.000  tied          
    0.000  0.000  tied          
    0.000  0.000  tied          
    0.500  0.500  tied          
    0.000  0.000  tied          

won   0 times
tied  5 times
lost  0 times

total unique fn went from 1 to 1 tied          
mean fn % went from 0.1 to 0.1 tied          

ham mean                     ham sdev
   0.27    0.22  -18.52%        3.16    2.51  -20.57%
   0.36    0.33   -8.33%        3.83    3.61   -5.74%
   0.68    0.66   -2.94%        7.28    7.21   -0.96%
   0.14    0.10  -28.57%        1.03    0.89  -13.59%
   0.31    0.30   -3.23%        2.54    2.54   +0.00%

ham mean and sdev for all runs
   0.35    0.32   -8.57%        4.13    3.97   -3.87%

spam mean                    spam sdev
  99.90   99.82   -0.08%        1.02    1.28  +25.49%
  99.74   99.83   +0.09%        2.99    1.98  -33.78%
  98.91   98.91   +0.00%        5.15    5.11   -0.78%
  98.39   98.44   +0.05%        9.37    9.35   -0.21%
  98.86   98.79   -0.07%        6.36    6.84   +7.55%

spam mean and sdev for all runs
  99.16   99.16   +0.00%        5.77    5.79   +0.35%

ham/spam mean difference: 98.81 98.84 +0.03

I suspect that the results would have been better if I had chosen
more recent spam. I think that I inadvertently chose the oldest spam
from my spam archive.

In case anyone would like to play with it, I'll append my trivial
patch. It requires pydns from:

http://sourceforge.net/projects/pydns/

I think that some lines may need to un-wrapped by hand. The code is
governed by the option x-pick_apart_urls so you'll need to have that
turned on for it to work. If want to do comparison testing, you'll
want that option turned on for both runs. You should note that while
an individual DNS lookup is pretty cheap, doing thousands of them
slows the test down a lot and may hammer your resolving nameserver
pretty hard.

I hacked it up in a way that suits me for testing only. Among the
things that ought to be changed if anyone wants it added to the
distributed code:

It should have its own option
The timeout should be configurable
The imports should be moved to a sane place <wink>

Regards,
Matt

*** tokenizer.py.orig   2004-04-10 12:13:20.000000000 -0500
--- tokenizer.py        2004-04-10 15:34:21.000000000 -0500
***************
*** 1052,1057 ****
--- 1052,1078 ----
              url = urllib.unquote(url)
scheme, netloc, path, params, query, frag =
urlparse.urlparse(url)

+ 
+             import DNS
+             import DNS.Base
+             DNS.DiscoverNameServers()
+             r=DNS.DnsRequest(timeout=1)
+             try:
+               replies=r.req(netloc).answers
+             except DNS.Base.DNSError:
+               pass
+             else:
+               for reply in replies: # Should we limit to one A
record?
+                 if reply["typename"]=="A":
+                   dottedQuad=reply["data"]
+                   pushclue("url-ip:%s/32" % dottedQuad)
+                   dottedQuadList=dottedQuad.split(".")
+                   pushclue("url-ip:%s/8" % dottedQuadList[0])
+                   pushclue("url-ip:%s.%s/16" %
(dottedQuadList[0],dottedQuadList[1]))
+                   pushclue("url-ip:%s.%s.%s/24" %
(dottedQuadList[0],
+                     dottedQuadList[1],dottedQuadList[2]))
+ 
+ 
# one common technique in bogus "please (re-)authorize
yourself"
# scams is to make it appear as if you're visiting a
valid
# payment-oriented site like PayPal, CitiBank or eBay,
when you