[Spambayes] Beyond Spambayes

Wed Feb 22 18:52:49 CET 2006

On 21 Feb 2006 15:00:56 -0600, Richard B Barger wrote:

> I've been a very pleased Spambayes user for a couple of years.
> Because we have a bunch of public business email addresses, I receive
> a huge volume of email, mostly spam.
>
> I've been delighted with Spambayes, so I wanted to describe what my
> local ISP, Skyway Networks, is doing that is like Spambayes On
> Steroids (I was a beta tester):

This is a typical post-acceptance content analysis system.  It is
effective at keeping a lot of spam away from the user's mailbox, but it
suffers from the same problems as most systems of it's type (see below).

<...>

> Here's a brief overview of the process it goes through:
>
> - Before accepting a message, the system checks if the email address
>   is valid. This protects against directory-harvesting attacks by
>   spammers.
> - When the message is accepted, it is next checked for
>   worms/viruses, using three different anti-virus programs.

Here is the basic problem with this approach that is common in this
class of system.  As long as the recipient envelope address is valid,
the message is accepted for delivery and only _then_ processed to
determine if it is spam.  This is only one step beyond the old
store-and-forward architectures in that it checks for a valid recipient
before accepting.  Since most incoming messages are spam today, the MTA
is forced to silently discard most of what it accepts.  This breaks most
of the assumptions behind SMTP.

Accepting a message for delivery means you accept the responsibility to
do one of two things:  deliver the message to the intended recipient or
send a Delivery Status Notification (DSN or bounce) to the original
sender so they know their mail was not delivered.  Since spam usually
has forged return-addresses, you can't send a DSN.  Unless you know the
return address is not a forgery, you shouldn't accept anything that you
may not even attempt to deliver.  Because no system can completely avoid
false positives, the one thing you want to avoid is accepting mail for
delivery and then silently discarding it.  Unfortunately, under the
duress of high spam loads, that is exactly what many older system
designs do.  The cost of the additional bandwidth and CPU usage has to
be borne by the customers, so this approach is far from optimal.

To avoid this, you do as many things as possible during the SMTP
conversation, with an emphasis on rejecting messages at the envelope
stage where you have expended a minimum of resources.  This saves you
bandwidth and avoids the high CPU load of content analysis tools like
virus scanners, SpamAssassin, Pyzor and other techniques that you
describe.  For example, the IP-based DNSBL check should be done
immediately upon request for the SMTP connection.  Why even have a
conversation with an MTA that is blacklisted?  In the unusual event of a
false positive, your sender knows immediately that their message was not
delivered because they get a DSN, rather than assuming you received and
ignored their message.

Another reason for rejecting as much spam as possible rather than
accepting and silently discarding it is that the spammers _know_ their
message went undelivered.  If a message is accepted, they know there is
a minute chance that it will make it into a users inbox.  That small
probability is the basis of their business.  The more MTA's that reject
spam during SMTP, the worse their business model appears.  They don't do
this for fun, they need to make money.  To do that, they have to get
their messages accepted at recipient MTA's.  A rejection says there is
0% chance the message will be seen by anyone.

By employing a variety of rejection tools (i.e. DNSBL's for the
connecting IP plus HELO name and rDNS heuristics), most of the load can
be rejected during the envelope phase of SMTP.  For the ones that make
it past the envelope, it is still possible to do the remaining content
checks during the DATA phase and make the sender wait before confirming
acceptance with a 250 code.  Many people argue that spammers often abuse
pipelining and dump the whole message after the DATA command then
disconnect, not waiting around for the acceptance.  Any MTA behaving
that way can be added to a local DNSBL so you don't talk to them next
time.  Similarly, there are a number of heuristics that can catch this
type of spammer early:  put in a delay after the connection request
before you send the banner.  Anyone who doesn't wait for the end of
banner can be safely disconnected and blacklisted for the future.  If
you want to perform a public service, tarpit them instead of merely
rejecting and blacklisting.  That takes almost none of your resources
and a lot of theirs, thereby reducing the amount of spam they can send
out to others.  A small number of well-placed tarpits can bring a large
number of spamming MTA's to their knees and if they are trojaned Windows
boxes, cause them to crash.

Spambayes, like all other MUA solutions, is a tool of last resort.  It
happens to be among the best in its category, but it has to catch
whatever spam your MTA fails to reject.  The less spam it has to deal
with, the less likely you are to ever see any of it.  In addition, the
less spam that your MTA accepts and silently rejects, the less the
chance of silently discarding a wanted communication and the more
spammers know their spew is not being delivered.

It sounds like their implementation is well-done for its type, but it
does not use best current practices.

--
Seth Goodman