[spambayes-dev] Re: Generating SB tokens based upon
information onthe net
Brad Knowles
brad.knowles at skynet.be
Wed Aug 4 20:51:37 CEST 2004
At 12:45 PM -0400 2004-08-04, Kenny Pitt wrote:
> In
> cross-validation testing, I found that the results had virtually no effect
> on the accuracy of the classifier, probably because one or two DNSBL tokens
> weren't enough to override the effects of all the other tokens from the
> message itself.
When I have used DNSBLs in the past, I have used more than just
one or two. I typically use a dozen or two. That should generate
enough additional information to have a significant impact.
> It also resulted in a *huge* increase in the time required
> for SpamBayes to classify a message.
The DNS lookup time can be significant. That's true. That's
part of why you want to mirror all the DNSBLs that you use so that
you can query them locally, as opposed to having to go across the
Internet to get that information. It takes coordination to set this
up, but all the major DNSBL providers make these sorts of
arrangements as a matter of course, and I'm sure that we wouldn't
have any problems. I've done this plenty of times before.
As for how much additional work is required to process the
additional information, I do not know.
> Most DNSBL's have an aging
> feature so that mailhosts will be removed from the blacklist if no spam has
> been received or reported from them in a certain time period.
Yup.
> If I query a
> DNSBL for a particular host tomorrow, I might get a different result than I
> got today.
Indeed.
> This is especially problematic for anyone using a
> train-on-everything strategy. If SpamBayes identifies a message incorrectly
> today and automatically trains on it, but I don't get around to reviewing
> and correcting the training until tomorrow, I could end up trying to remove
> the wrong set of tokens from the incorrect training corpus and thus
> corrupting my training database.
Ahh. Well, the aging issue is not likely to be a problem unless
you have waited a very significant amount of time between gathering
the data and trying to process it -- most servers that are used for
spam continue to be used for spam for quite some time.
You might have problems the other way, however -- servers that
were clean at the time, and which you have saved in your
"misclassified" folder, may now be on a black list by the time you
try to process the information.
--
Brad Knowles, <brad.knowles at skynet.be>
"Those who would give up essential Liberty, to purchase a little
temporary Safety, deserve neither Liberty nor Safety."
-- Benjamin Franklin (1706-1790), reply of the Pennsylvania
Assembly to the Governor, November 11, 1755
SAGE member since 1995. See <http://www.sage.org/> for more info.
More information about the spambayes-dev
mailing list