[Spambayes] Deployment

Greg Ward gward@python.net
Fri, 6 Sep 2002 12:25:05 -0400

On 06 September 2002, Guido van Rossum said:
> Quite independently from testing and tuning the algorithm, I'd like to
> think about deployment.

I was just pondering this this morning.

In case it wasn't obvious, I'm a strong proponent of filtering junk mail
as early as possible, ie. right after the SMTP DATA command has been
completed.  Filtering spam at the MUA just seems stupid to me -- by the
time it gets to me MUA, the spammer has already stolen my bandwidth.  My
public addresses are gward@python.net and gward@mems-exchange.org, so I
want spam stopped by the mail servers for those two domains.  (Hence the
recent MTA switch on starship...)

I guess MUA-level filtering is just a fallback for people who don't have
1) a burning, all-consuming hatred of junk mail, 2) root access to all
mail servers they rely on, and 3) the ability and inclination to install
an MTA with every bell and whistle tweaked to keep out junk mail.

Anyways, here's how I think it should work:

  * as soon as the DATA command is completed, the MTA passes the
    message to some local message-scanning code: a milter with Sendmail,
    local_scan() with Exim.  Dunno if any other MTAs have similar

  * the local scanner feeds the message to spambayes; if it says
    "yep, this is spam", the local scanner generates an SMTP rejection
    message, which the MTA returns to the client, eg.

      550-rejected -- looks like spam
      550 (see http://mail.python.org/spam/17nLfU-0003IT-00)

    The hypothetical web page (one per rejected message) would give an
    explanation of why the message was considered spam (eg. the top 15
    keywords), and give the sender the option to "request review" --
    what I'm thinking is send email to postmaster, and one of the
    postmasters will pop over to another web page, look at the message,
    and either rescue it or decide that it really is spam.

    Yes, I'm willing to risk giving spammers information in order to
    make life easier for false positive victims.  I very much doubt that
    spammers read SMTP rejection messages.

As for feeding the message to spambayes: for the Exim servers that I
have a hand in, the local_scan() function is written in Python, so there
shouldn't be any need to spawn a sub-process or open a socket to do
this.  Other sites may not be so lucky, in which case a fast,
low-overhead way to evaluate a message is essential.  Python's startup
overhead is not trivial, but I'd bet Python+spambayes is much faster to
startup than Perl+SpamAssassin.  Python has bytecode compilation, and
the spambayes database is much simpler than SpamAssassin's ruleset.
(Especially if the pickle is changed to a DB, DBM, or CDB file.)  So a
spamd-style daemon is worth considering, but not necessarily the answer.

Anyways, I've outlined a way to gather false positives above.  We
already have a protocol for dealing with false negs -- forward them to
spam@python.org.  Just have to figure out what to do with them then.
(Currently they're piling up in /var/mail/nc-spam [nc = not caught] on

> Eventually, individuals and postmasters should be able to download a
> spambayes software distribution, answer a few configuration questions
> about their mail setup, training and false positives, and install it
> as a filter.

Note that SpamAssassin is not as simple as that to install -- I think
the "few configuration questions about their mail setup" is a massive
blackhole that's best avoided.  SA provides tools that tell you whether
something looks like spam, and how spammy it is.  Everything else is up
to the local admin, which makes eminent sense to me.  The mantra is:
SpamAssassin is a tool for *detecting* spam, not for rejecting
it/discarding it/moving it somewhere/whatever.  Do one thing, and do it

The downside of that approach is that every MTA/MDA(/MUA?) community has
to figure out clever ways to integrate SpamAssassin.  This might not be
a bad thing: the ways to integrate SA with Exim just keep getting better
and better, and the SA people don't really have to worry about that.

Greg Ward <gward@python.net>                         http://www.gerg.ca/
A committee is a life form with six or more legs and no brain.