[Spambayes] Deleting spam from the server using only the headers

Tue Sep 23 19:47:21 EDT 2003

> David Mertz has pointed out some research he did a year ago on 
> applying statistical methods for detecting spam in email 
> headers.  The innovation he implemented was to break the 
> headers up into trigrams (sequences of three characters) and 
> statistically look for suspicious patterns among the trigrams 
> in the headers.
> With all the interest in virus-generated spam these days, I thought
> David had an interesting concept.  Does this look like something that
> could be adapted to Spambayes?

It would be a piece of cake.  Just modify Tokenizer() in tokenizer.py to
generate tri(character)grams instead of split-on-whitespace, comment out
tokenize_body(), and run some tests (see the testtools directory).  Post
the results to the list (or spambayes-dev), and there you go!

Note that there has been some header/body only testing done, and the
results were reasonable.  Googling through the archives should pop the
relevant posts up.  In particular, header-only classification looked
like it would be good enough to temporarily leave spam on the server
(while you were away, or when connecting via a mobile, or something like
that).  (This is with s-o-w, not trigrams).

=Tony Meyer