[spambayes-dev] Re: Generating SB tokens based upon information onthe net

Wed Aug 4 18:45:07 CEST 2004

Brad Knowles wrote:
> 	If we're not doing DNS blacklist lookups within SpamBayes, then I
> think we need to seriously look at adding that capability in some
> other fashion.  My experience has been that these are some of the
> most important information sources you can have available to you when
> attempting to score a message for spam probability.    

I wrote a patch a while back (never submitted to SourceForge) that would
query a list of DNS blacklists and insert the results as tokens.  In
cross-validation testing, I found that the results had virtually no effect
on the accuracy of the classifier, probably because one or two DNSBL tokens
weren't enough to override the effects of all the other tokens from the
message itself.  It also resulted in a *huge* increase in the time required
for SpamBayes to classify a message.

As I mentioned in another recent post, any dynamic tokens like this can also
cause problems in the SpamBayes training.  Most DNSBL's have an aging
feature so that mailhosts will be removed from the blacklist if no spam has
been received or reported from them in a certain time period.  If I query a
DNSBL for a particular host tomorrow, I might get a different result than I
got today.  This is especially problematic for anyone using a
train-on-everything strategy.  If SpamBayes identifies a message incorrectly
today and automatically trains on it, but I don't get around to reviewing
and correcting the training until tomorrow, I could end up trying to remove
the wrong set of tokens from the incorrect training corpus and thus
corrupting my training database.

-- 
Kenny Pitt