[Spambayes] Deleting spam from the server using only the headers
wtrenker at shaw.ca
Tue Sep 23 12:37:45 EDT 2003
The request to have tools to automatically delete Spam on the server has come up often. Implied is the idea to use the POP3 TOP command to download just the message headers, determine spam-iness by analyzing the headers only, and then delete the spam on the server before it ever reaches the email client.
About as often as this seemingly reasonable request surfaces, the response is to point out that it is difficult to use a statistical (Bayesian) technique on just the message headers. For one thing, email headers don't consist of enough words from a wide enough distribution to provide a meaningful sample for calculating a reliable spam probability. So a technique like Spambayes doesn't seem to be possible for this 'headers only' approach to killing spam.
Well, I just noticed that over on python-list at python.org, in a message titled _Re: pop3 email header classifier?_, David Mertz has pointed out some research he did a year ago on applying statistical methods for detecting spam in email headers. The innovation he implemented was to break the headers up into trigrams (sequences of three characters) and statistically look for suspicious patterns among the trigrams in the headers. David's article on IBM developerWorks (http://www-106.ibm.com/developerworks/linux/library/l-spamf.html) provides more details and a link to his prototype Python code.
With all the interest in virus-generated spam these days, I thought David had an interesting concept. Does this look like something that could be adapted to Spambayes?
More information about the Spambayes