[Spambayes] Re: Deleting spam from the server using only the headers

David Mertz, Ph.D. mertz at gnosis.cx
Tue Sep 23 21:00:18 EDT 2003


William Trenker <wtrenker at shaw.ca> wrote:
|Well, I just noticed that over on python-list at python.org, in a message
|titled _Re:  pop3 email header classifier?_, David Mertz has pointed out
|some research he did a year ago on applying statistical methods for
|detecting spam in email headers.  The innovation he implemented was to
|break the headers up into trigrams (sequences of three characters) and
|statistically look for suspicious patterns among the trigrams in the
|headers.  David's article on IBM developerWorks
|(http://www-106.ibm.com/developerworks/linux/library/l-spamf.html)
|rovides more details and a link to his prototype Python code.

I am indeed happy with my approach.  And even fairly confident that the
trigram model will do better for headers-only than will the word model.

That said, a year ago when I wrote the article mentioned, Spambayes was
in its infancy, and I did not test it.  I have not followed the work
with Spambayes closely--but I have followed it enough to know that Tim
Peters and others have done quite a lot of work exploring variations of
statistical models.  I am quite certain Spambayes is quite a lot better
than the naive Bayesian stuff I do (with simplified weighting rules
yet).  And I even seem to recall reading that Spambayes had some N-gram
options in there, or at least experimented with that.

Still, it would be nice to have a good, friendly, system to do basically
what my homebrew 'spamfilter.py' does.  That is, run periodically, check
only headers, and delete obviously spammy messages without ever
downloading.

Yours, David...

--
mertz@  | The specter of free information is haunting the `Net!  All the
gnosis  | powers of IP- and crypto-tyranny have entered into an unholy
.cx     | alliance...ideas have nothing to lose but their chains.  Unite
        | against "intellectual property" and anti-privacy regimes!
-------------------------------------------------------------------------




More information about the Spambayes mailing list