[spambayes-dev] Re: Generating SB tokens based upon information onthe net

Brad Knowles brad.knowles at skynet.be
Wed Aug 4 20:51:37 CEST 2004


At 12:45 PM -0400 2004-08-04, Kenny Pitt wrote:

>                                                                    In
>  cross-validation testing, I found that the results had virtually no effect
>  on the accuracy of the classifier, probably because one or two DNSBL tokens
>  weren't enough to override the effects of all the other tokens from the
>  message itself.

	When I have used DNSBLs in the past, I have used more than just 
one or two.  I typically use a dozen or two.  That should generate 
enough additional information to have a significant impact.

>                   It also resulted in a *huge* increase in the time required
>  for SpamBayes to classify a message.

	The DNS lookup time can be significant.  That's true.  That's 
part of why you want to mirror all the DNSBLs that you use so that 
you can query them locally, as opposed to having to go across the 
Internet to get that information.  It takes coordination to set this 
up, but all the major DNSBL providers make these sorts of 
arrangements as a matter of course, and I'm sure that we wouldn't 
have any problems.  I've done this plenty of times before.

	As for how much additional work is required to process the 
additional information, I do not know.

>                                             Most DNSBL's have an aging
>  feature so that mailhosts will be removed from the blacklist if no spam has
>  been received or reported from them in a certain time period.

	Yup.

>                                                                 If I query a
>  DNSBL for a particular host tomorrow, I might get a different result than I
>  got today.

	Indeed.

>              This is especially problematic for anyone using a
>  train-on-everything strategy.  If SpamBayes identifies a message incorrectly
>  today and automatically trains on it, but I don't get around to reviewing
>  and correcting the training until tomorrow, I could end up trying to remove
>  the wrong set of tokens from the incorrect training corpus and thus
>  corrupting my training database.

	Ahh.  Well, the aging issue is not likely to be a problem unless 
you have waited a very significant amount of time between gathering 
the data and trying to process it -- most servers that are used for 
spam continue to be used for spam for quite some time.

	You might have problems the other way, however -- servers that 
were clean at the time, and which you have saved in your 
"misclassified" folder, may now be on a black list by the time you 
try to process the information.

-- 
Brad Knowles, <brad.knowles at skynet.be>

"Those who would give up essential Liberty, to purchase a little
temporary Safety, deserve neither Liberty nor Safety."

     -- Benjamin Franklin (1706-1790), reply of the Pennsylvania
     Assembly to the Governor, November 11, 1755

   SAGE member since 1995.  See <http://www.sage.org/> for more info.


More information about the spambayes-dev mailing list