[Spambayes] Deleting spam from the server using only the headers

Tue Sep 23 22:23:26 EDT 2003

[William Trenker]
> The request to have tools to automatically delete Spam on the
> server has come up often.  Implied is the idea to use the POP3 TOP
> command to download just the message headers, determine spam-iness
> by analyzing the headers only, and then delete the spam on the
> server before it ever reaches the email client.
>
> About as often as this seemingly reasonable request surfaces, the
> response is to point out that it is difficult to use a statistical
> (Bayesian) technique on just the message headers.  For one thing,
> email headers don't consist of enough words from a wide enough
> distribution to provide a meaningful sample for calculating a
> reliable spam probability.

According to who?  BTW, I don't know of any way to calculate a reliable spam
probability, and this project doesn't even try.  We compute "a score", and
don't even claim that it's monotonic with spam probability.  If someone does
claim to compute a reliable spam probability, and you set your spam cutoff
to, say, 0.99, that means about 1 of each 100 things it calls spam will
actually be false positives -- or that its claim to compute reliable
probabilities is wrong.

> So a technique like Spambayes doesn't seem to be possible for this
> 'headers only' approach to killing spam.

We tested this before.  A headers-only classifier worked fine.  So did a
body-only classifier.  Of course looking at both does best.  Heck, at one
time I even tested a classifier that looked at absolutely nothing except the
Subject line.  That did enormously better than luck, and probably better
than 99.99% of user-defined Outlook spam-catching rulesets, but was
significantly poorer than looking at the other header lines too.

> Well, I just noticed that over on python-list at python.org, in a
> message titled _Re: pop3 email header classifier?_, David Mertz has
> pointed out some research he did a year ago on applying statistical
> methods for detecting spam in email headers.  The innovation he
> implemented was to break the headers up into trigrams (sequences of
> three characters) and statistically look for suspicious patterns
> among the trigrams in the headers.  David's article on IBM
> developerWorks
> (http://www-106.ibm.com/developerworks/linux/library/l-spamf.html)
> provides more details and a link to his prototype Python code.

We saw significantly worse results with character n-grams (for n in 2 thru
5) than with other tokenization strategies.  See comments in tokenizer.py
for discussions of specifics.  Maybe surprisingly, spambayes uses different
tokenization strategies for different kinds of header lines, and that helped
too.

> With all the interest in virus-generated spam these days, I thought
> David had an interesting concept.  Does this look like something
> that could be adapted to Spambayes?

If you're a Python programmer and this interests you, it should be quite
easy to try it.