[Spambayes] Conclusions: slow training and x-lookup-ip

David Abrahams dave at boostpro.com
Sat Feb 13 19:11:11 CET 2010


So here's what I came up with:

* The cache's default is to cache the results of failed lookups for
  only 5 minutes (not sure why; many people keep these results for an
  hour or more).

* The cache's default is to time out while trying to do a lookup after
  10 seconds.

* Combine these with a large training set and it's very easy to blow
  the 5 minute limit.  After thirty DNS timeouts, we've already used
  up our 5 minutes, and all the failed lookup data from the last pass
  is considered stale.  In fact, the ten-second timeout is enough to
  cause plenty of good lookup data to be considered stale

* For a nightly training run, I suppose this is OK—it's not like we're
  burning CPU cycles while waiting for the DNS request to time out—but
  it does cause way more DNS traffic than necessary.

* I don't see any reason that these numbers should be considered good
  for individual message classification, either.  Most spambots seem
  to do their business at around the same time every night, and with a
  less-than-24-hour cache for the failed lookups that are so prominent
  in spam, I'm guessing the cache for those hostnames has to be
  rebuilt each time.

When I lower the DNS timeout to one second and raise the time that
failed lookups are cached to one hour, repeated training runs now go
very fast and do very very few DNS lookups.  The few lookups that do
happen seem to be producing NXDOMAIN errors, where something in the
DNS response tells us how long this information should be cached based
on RFC 2308.

I propose to raise the time these failed lookups are cached and lower
the timeout on DNS queries.  Here's a patch that uses 48 hours and 1
second, respectively, for those values.

-------------- next part --------------
Index: dnscache.py
===================================================================
--- dnscache.py	(revision 3256)
+++ dnscache.py	(working copy)
@@ -84,10 +84,10 @@
         self.returnSinglePTR = True
 
         # How long to cache an error as no data
-        self.cacheErrorSecs=5*60
+        self.cacheErrorSecs=2*24*60*60
 
         # How long to wait for the server
-        self.dnsTimeout=10
+        self.dnsTimeout=1
 
         # end of user-settable attributes
 
-------------- next part --------------



-- 
Dave Abrahams           Meet me at BoostCon: http://www.boostcon.com
BoostPro Computing
http://www.boostpro.com



More information about the SpamBayes mailing list