[Spambayes] More on 'Spammer Attempts to CircumventBayesianFilter'

Seth Goodman sethg at GoodmanAssociates.com
Tue Jul 20 17:21:05 CEST 2004


> From: Richard B Barger ABC APR
> Sent: Monday, July 19, 2004 2:29 PM
>
>
>
> Thank you, Kenny, for taking my flight of fancy seriously.
> That's another good
> explanation.
>
> Demonstrating how little I know, I want to take one more stab at this.
>
> I postulate that a long, flowing narrative will have lots of
> neutral words, but,
> over a large enough user base -- and that is the key -- will
> have more of what
> the mass of users considers ham-tending words than
> spam-tending words (even
> though, in the case of particular users, the opposite
> doubtless would be the
> case).
>
> I'm speculating that, over the course of a large quantity of
> spam and a large
> quantity of ham, fewer tokens show up as spam in most user's
> evenly trained
> databases than show up as ham.  Put another way, even
> allowing for differences
> in user preference and experience, I'm guessing that the
> dictionary of spam
> tokens is smaller than the dictionary of ham tokens.
>
> Actually, that somewhat smaller "dictionary" probably works
> both ways, but I
> still theorize that the larger number of likely ham words
> over the universe of
> users will result in longer narratives being somewhat more
> likely to be judged
> ham.
>
> I completely agree about the random gibberish.
>
> Thank you again.

Though I'm far from the expert that Kenny is on this, I'd like to throw
in my two cents on your hypothesis.  It is probably true that properly
selected hammy words could be used to lower the spam score of a message
over a large population of users.  Therein lies the weakness in this
method of attack.  The attacker does not have access to the training
sets of the large population of users (s)he wishes to reach.  Spammers
certainly can run Spambayes and test their messages against their own
training sets, but this would not be very helpful.  They could enlist a
number of their buddies and test the candidate messages against a number
of training sets, however, even these would probably not be
representative of their target population, IMHO.

Since Spambayes is a post-acceptance content filtering tool that gives
no feedback to the spammer, they have no direct way to gain the
information they need.  About the only way an attacker can test a given
candidate message designed to lower their spam score with Spambayes is
to do a trial spam run and evaluate their response.  Whatever change
they see would have to be larger than the normal statistical variation
in response rate.  Even with this information, they have no way of
knowing how much of the difference is due to lowering their scores in
Spambayes, as there are a plethora of tools used by ISP's and end-users.
Most of those tools allow customization of the rule sets, and they have
no way of knowing how many users at a given provider use what tools, or
what custom rules the provider has put in place for global filtering.  I
think this puts an attacker at an extreme disadvantage, forcing them to
test candidate messages one spam run at a time.

The diversity of post-acceptance content-filtering methods is one of our
strongest advantages.  What works as a strategy with one particular tool
does not necessarily work on another.  Since Spambayes is effectively a
content filter where every user has their own custom rule set, it is
among the hardest to beat.  Without going to the trouble and expense of
recruiting and paying large groups of supposedly typical users to form a
test population, they can't gain much insight into Spambayes rejections.
My guess is that most of their target population would not participate
in such "focus group" studies, so they would have great difficulty
putting together a "focus group" that is truly representative.  In
general, it is very, very hard to bring down any distributed system that
has decentralized administration, and that's what we are dealing with
here.

It's always useful to remember the old "bear run" analogy, which is
applicable to many computer security issues.  Two people are being
chased by a bear.  The first person cries, "We are finished.  We can't
possibly outrun the bear." :(  The second person says, "It's really not
as bad as it seems.  Neither of us can run faster than the bear, but I
can run faster than you." :)  Spambayes is probably the hardest target
out there.  Spammers, in their own interest, are probably forced to
concentrate on easier targets.

--

Seth Goodman



More information about the Spambayes mailing list