[spambayes-dev] Re: spambayes-dev Digest, Vol 12, Issue 15

Thomas Juntunen juntunen at well.com
Sat Apr 17 11:27:29 EDT 2004

Hash: SHA1

On 04/16/04, Skip Montanaro imposed order on a stream of electrons to say:

>I don't know how 86% relates to how much spam those two features would
>reliably detect, especially in the presence of ham, but my guess is that it's
>much less than the 99+% we need to have an effective spam filtering solution.

Absolutely. I was trying to make the point that we've found spammers change their tactics much more slowly than is commonly assumed. FWIW, the single most common characteristic of my corpus, HTML/mutlipart with no other parts, would stop around 37% of spam by itself. If anything, this research says a two-stage system, simple SA or some such to stop the real grunge, then SB or some such to apply more selective filtering on a smaller inflow, should be workable.

>It's clear that spammers try different things.

Yep. I don't have a number handy, but consider that a message can only be munged in so many ways before it is undeliverable. The total useful permutations might be too large for a human to handle easily, but I'm betting not for a computer.

[snip description of spammer tricks]

>I believe they will continue to try other tricks.  One can hope that they are
>running out of tricks to try, but I'm pessimistic.

It's interesting you mention this. I can't say a whole lot right now, but Dr. Sullivan has devised an interesting technique that statistically looks at all the sorts of things you've mentioned. We looked at that stuff in order to try and pin down which spamware any particular spammer might be using, since all those tricks can be considered characteristics of spamware. We came to realize, they are also characteristics of the spammers themselves. Working in conjunction with some folks from Spamhaus, Sullivan is refining a technique to "fingerprint" particualr spammers by their choices of URLs/domains, presentation, and so forth. This only works for spammers whose volume is high enough to overcome the "noise" inherent in email, but after letting his tool work through a corpus and group spam messages by sender, then manually checking these with WHOIS, dig and so forth, the tool is right a little over 50% of the time with no training whatsoever. I think he is planning to present something about this at some conference (CEAS?) this summer.

Anyway, all I wanted to try and make clear was there was statistical evidence that spam techniques change a lot more slowly than people usually assume. Not that this was some form of better filtering. In fact, I've been waiting for SpamBayes to get to at least a beta release so I can install it on my Apple laptop.

Thanks for the feedback!
Thomas Juntunen

Version: PGP SDK 3.0


More information about the spambayes-dev mailing list