[Spambayes] More on 'Spammer Attempts to Circumvent Bayesian
Richard B Barger ABC APR
Rich at RBarger.com
Tue Jul 20 18:14:27 CEST 2004
What a terrific analysis, Seth -- highly informative, laced with good
humor. Thank you.
I learned a bucketful from your response; I particularly like the
description of SpamBayes as "a post-acceptance content filtering tool" and
the image of spammers conducting focus groups. Nicely put.
If I were a spammer, I wouldn't bother to do the type of testing that
logical business types would use to defeat SpamBayes or other anti-spam
methodologies, although you've provided a pretty good business case. <g>
I'd just operate on the principle that, if I do enough stuff "right" and
rely on the law of large numbers, I'll get my share of suckers. And it
doesn't take many responses to make these spamming nitwits' (actually, I
have another name for them) efforts successful.
I still speculate that, over a large enough number of users, the longer the
"normal-seeming" narrative, the more hammy the message appears to their
individual SpamBayes tokenizers.
In an entire standard dictionary, there are:
- far, far more words that no one uses than that most people use; these
would be discounted by SpamBayes
- far more words that are generally considered ham (across a large number of
people) than that are considered spam (by SpamBayes training)
To me, that means that a couple of relatively long (How long? I have no
idea!) neutral-seeming narrative passages would be likely to raise ham
content scores somewhat, because the longer they are, the more likely they
are to contain more ham-appearing words.
Would this be enough to overcome SpamBayes? Of course not, in most cases.
And if a user is sufficiently interested in avoiding spam to use this
excellent product, he's certainly not going to be tricked by a spam message
that makes it into his unsure -- or even his ham -- folder.
So I'm not talking, in particular, about outcomes for the spammer. I'm just
interested in the theory behind SpamBayes' handling of larger coherent
narratives, which in my sample of 34,678 messages now represent the
second-most-frequent type of file that hits my Unsure folder.
I, too, recognize them when I see them. I'm just trying to figure out how
to make SpamBayes equally sensitive. <g>
This has been an education, Seth. Thanks to you and the other gurus who
have given thought and excellent answers to my questions.
Seth Goodman wrote:
[Note: RBB talked about long narratives, ham-tending words, and
> Though I'm far from the expert that Kenny is on this, I'd like to throw in
> my two cents on your hypothesis. It is probably true that properly
> selected hammy words could be used to lower the spam score of a message
> over a large population of users. Therein lies the weakness in this
> method of attack. The attacker does not have access to the training sets
> of the large population of users (s)he wishes to reach. Spammers
> certainly can run Spambayes and test their messages against their own
> training sets, but this would not be very helpful. They could enlist a
> number of their buddies and test the candidate messages against a number
> of training sets, however, even these would probably not be representative
> of their target population, IMHO.
> Since Spambayes is a post-acceptance content filtering tool that gives no
> feedback to the spammer, they have no direct way to gain the information
> they need. About the only way an attacker can test a given candidate
> message designed to lower their spam score with Spambayes is to do a trial
> spam run and evaluate their response. Whatever change they see would have
> to be larger than the normal statistical variation in response rate. Even
> with this information, they have no way of knowing how much of the
> difference is due to lowering their scores in Spambayes, as there are a
> plethora of tools used by ISP's and end-users. Most of those tools allow
> customization of the rule sets, and they have no way of knowing how many
> users at a given provider use what tools, or what custom rules the
> provider has put in place for global filtering. I think this puts an
> attacker at an extreme disadvantage, forcing them to test candidate
> messages one spam run at a time.
> The diversity of post-acceptance content-filtering methods is one of our
> strongest advantages. What works as a strategy with one particular tool
> does not necessarily work on another. Since Spambayes is effectively a
> content filter where every user has their own custom rule set, it is among
> the hardest to beat. Without going to the trouble and expense of
> recruiting and paying large groups of supposedly typical users to form a
> test population, they can't gain much insight into Spambayes rejections.
> My guess is that most of their target population would not participate in
> such "focus group" studies, so they would have great difficulty putting
> together a "focus group" that is truly representative. In general, it is
> very, very hard to bring down any distributed system that has
> decentralized administration, and that's what we are dealing with here.
> It's always useful to remember the old "bear run" analogy, which is
> applicable to many computer security issues. Two people are being chased
> by a bear. The first person cries, "We are finished. We can't possibly
> outrun the bear." :( The second person says, "It's really not as bad as
> it seems. Neither of us can run faster than the bear, but I can run
> faster than you." :) Spambayes is probably the hardest target out there.
> Spammers, in their own interest, are probably forced to concentrate on
> easier targets.
> Seth Goodman
More information about the Spambayes