[Spambayes] spammers have found work around?

Fri Nov 5 03:20:20 CET 2004

[Mumm's]
> I've been using spambayes quite a while now with remarkably good
> results, almost 0 spam for many months.
> Recently, though, a few messages have been sneaking through in the 0%
> to 9% range.
> Do you think that spammers are reacting to spambays and other good
> intelligent filter spam blockers by crafting spam especially to get
> through.

Spammers have been working as hard as they can to evade filters of all
kinds for years, but I doubt anyone is targeting SB specifically. 
We're too small a target, and this kind of classifier is difficult to
sidestep reliably because it's so personalized.  Spam is a business,
and spammers get much higher return on investment by learning to fool
"one filter to rule them all" systems used by large organizations
(corporations, universities, ISPs).

Oversimplified, say you're a well-heeled spammer (as some are).  You
*buy* one of these systems, and then send spam to yourself until you
find a way to fool it.  To a first approximation, then, that spam has
a decent chance of fooling many installations of the same system.  You
send out millions of spam then as fast as you can, before the filter
vendor has a chance to change the system to plug the hole you found.

Now you can do the same thing with your own trained SpamBayes, but it
won't do you much good:  the training you do leaves you with a
different database than the training I do, and unless you spend real
money and effort to investigate me, you'll have no idea how to make
your spam look hammy to *my* classifier.  But if you could afford to
do intelligent targeted marketing, you'd get a higher return by moving
to a more traditional form of targeted advertising, trying to sell me
high-ticket legitimate items instead of assorted bottom-feeding scams.
 You'd be out of the spam business then.

That's why SB is hard to beat on a large scale.  It's not trying to
identify spam, it's trying to separate ham from spam according to an
individual's tastes.

> Anyone else noticing this change?

I notice that spam changes all the time.  For example, "Rolex" spam
has become very heavy over the past two weeks in my mix.  A few of
those were Unsure for me at first.  I trained on 2, and haven't seen
another one rate unsure since then.

BTW, I throw away my database a few times each year and start over
from scratch.  I know that this is fun, and slashes the database size.
 I suspect it helps recognition accuracy too, but don't know that for
a fact.  If you feel like you're in a rut, try it!  One common cause
for deteriorating accuracy is training a message into the wrong
category (ham as spam, or vice versa), and that's very hard to detect
after the fact.  As the months wear on, that's simply *going* to
happen sooner or later.  Starting over is often the easiest way to
recover from that.

> Will spambayes always be able to learn these new techniques?

Answering that requires perfect foreknowledge, so, yes, of course
<wink>.  We haven't made significant changes to the classifier or the
tokenizer in many months, but I haven't noticed any decrease in
effectiveness, despite that both the content and form of spam keeps
changing.

A good development is the increasing number of email clients that
refuse to download images in HTML email automatically.  I wish that
were universal.  So long as the spammer has to put stuff *in the msg*
itself that's visible to you, they have to make their sales pitch and
their URL visible to classifiers too, and then it can be analyzed. 
When all they give is a URL that automatically downloads a .gif or
.jpg containing an image of the sales message, that's very hard to
analyze.  But if email clients stop downloading that stuff, the
response rate on spam using that trick will fall to 0, and spammers
will stop doing that.  It's easy to forget that their goal isn't to
irritate you, it's to extract money from you.  To do that, they have
to make a visible sales pitch.