[Spambayes] Conclusions: slow training and x-lookup-ip
David Abrahams
dave at boostpro.com
Sat Feb 13 19:11:11 CET 2010
So here's what I came up with:
* The cache's default is to cache the results of failed lookups for
only 5 minutes (not sure why; many people keep these results for an
hour or more).
* The cache's default is to time out while trying to do a lookup after
10 seconds.
* Combine these with a large training set and it's very easy to blow
the 5 minute limit. After thirty DNS timeouts, we've already used
up our 5 minutes, and all the failed lookup data from the last pass
is considered stale. In fact, the ten-second timeout is enough to
cause plenty of good lookup data to be considered stale
* For a nightly training run, I suppose this is OK—it's not like we're
burning CPU cycles while waiting for the DNS request to time out—but
it does cause way more DNS traffic than necessary.
* I don't see any reason that these numbers should be considered good
for individual message classification, either. Most spambots seem
to do their business at around the same time every night, and with a
less-than-24-hour cache for the failed lookups that are so prominent
in spam, I'm guessing the cache for those hostnames has to be
rebuilt each time.
When I lower the DNS timeout to one second and raise the time that
failed lookups are cached to one hour, repeated training runs now go
very fast and do very very few DNS lookups. The few lookups that do
happen seem to be producing NXDOMAIN errors, where something in the
DNS response tells us how long this information should be cached based
on RFC 2308.
I propose to raise the time these failed lookups are cached and lower
the timeout on DNS queries. Here's a patch that uses 48 hours and 1
second, respectively, for those values.
-------------- next part --------------
Index: dnscache.py
===================================================================
--- dnscache.py (revision 3256)
+++ dnscache.py (working copy)
@@ -84,10 +84,10 @@
self.returnSinglePTR = True
# How long to cache an error as no data
- self.cacheErrorSecs=5*60
+ self.cacheErrorSecs=2*24*60*60
# How long to wait for the server
- self.dnsTimeout=10
+ self.dnsTimeout=1
# end of user-settable attributes
-------------- next part --------------
--
Dave Abrahams Meet me at BoostCon: http://www.boostcon.com
BoostPro Computing
http://www.boostpro.com
More information about the SpamBayes
mailing list