[spambayes-dev] was [Spambayes] date for new release to handle image spam?

Fri Feb 2 20:45:46 CET 2007

skip at pobox.com wrote on Friday, February 02, 2007 9:23 AM -0600:

> Seth Goodman wrote:
>
> > The word salad they use to drown out significant clues generally
> > fails, but if they throw enough words at it, they sometimes dilute
> > the spam clues sufficiently.  The fact that they throw hundreds of
> > "noise" words at the filters for every spam clue they want to hide
> > and Bayesian filters still catch half or three-quarters of it
> > shows how powerful the Bayesian approach really is....
>
> Hmmm... Could we do something to measure the amount of word salad
> without penalizing large non-image emails?

That's a very interesting idea:  a meta-analysis after tokenizing.  To
restate the hypothesis you imply:  spam using word salad may have a
different percentage of tokens that are significant clues than non-spam
email.  Taking this further, there may also be differences in the total
number of distinct tokens generated, and how many of those tokens are
from words versus synthetic tokens.  So in general, try to make use any
correlation between spamminess and meta-information like total number of
tokens generated, total number of word tokens generated, number of
significant clues and number of non-significant clues.  A very cool
general extension to Bayesian classification.

I don't know how you'd put this meta-information into a form that
Spambayes could make use of.  Let's see, the database tells you how many
times a given token appears in the ham/spam training sets.  From this
you calculate a spam probability that is combined with the results of
other tokens to give an overall spam probability.  For a numeric value
token, you want to calculate a spam probability of the numeric value
with respect to the values in the ham/spam training sets.  It's a
different calculation, but it is still probably amenable to using a
chi-square distribution so you can combine it with other clues.

>
> > - zombie hosts tend to be weak on SMTP etiquette, so one clue is
> >   that they often fail to wait when asked; making the SMTP client
> >   wait for 30 seconds before sending the "connect banner" often
> >   tricks impatient zombies into spewing, and you can then hang up;
>
> Yeah, but this is a job for postgrey and other similar tools.

Yes, sendmail/exim/qmail, but we're completely in agreement on the
location.  My point to the OP was that the MTA is the best place to make
spam filtering more effective by cutting down on the amount of spam
post-acceptance filters have to process.  The example was meant to show
the kind of behavioral clues that suggest an SMTP client may not be a
legitimate mail host and the connection refused.  I was suggesting that
doing the MTA part a little better has far greater return than anything
you do later.  I suspect that the best rejection criteria for image spam
is the identity of the SMTP client (a zombie host), and that's hard to
do once a message is delivered to a user mailbox.

After giving a few examples, I realized that the decision process is
similar to the one used in a post-acceptance spam filter, so perhaps
MTA's could make use of Bayesian classification to make better
decisions.  The current state of the art (OK, bleeding edge) is to use a
reputation system that accumulates reputation (hamminess) for each of
several possible sender identity types, identity qualification methods
and qualification results.  For example, there are three common
identities available at SMTP envelope time:  connecting IP address,
connecting hostname, and SMTP MAILFROM address (domain part only).
Because of the prevalence of forgery, you attempt to qualify each
identity using a hierarchy of possible methods.  Common methods to
qualify an identity are SPF and forward/reverse DNS.  Each qualification
method can produce results of pass, fail or unknown.  The tuple of
(identity, qualification method, qualification result) forms an atom in
the database and holds a reputation score.  There are also behavioral
clues from the connecting SMTP client which are useful when there is no
reputation data.  Finally, there is a time component so the data remains
current.

Every time a connecting MTA offers a message, the receiving MTA must
make a trinary decision analogous to what Spambayes does:  accept all
messages from this sender (whitelist), deny all messages from this
sender (blacklist), or allow the sender to present messages but filter
each one for content (unsure).  The quality of the decisions is
particularly important for senders with no reputation, as that is where
most spam comes from, yet it also includes infrequent senders with real
messages.  Sender in this context means mail host or domain that bounces
go to, not the mailbox address of the author.

--
Seth Goodman