[Spambayes] Deleting spam from the server using only the headers
T.A.Meyer at massey.ac.nz
Tue Sep 23 19:47:21 EDT 2003
> David Mertz has pointed out some research he did a year ago on
> applying statistical methods for detecting spam in email
> headers. The innovation he implemented was to break the
> headers up into trigrams (sequences of three characters) and
> statistically look for suspicious patterns among the trigrams
> in the headers.
> With all the interest in virus-generated spam these days, I thought
> David had an interesting concept. Does this look like something that
> could be adapted to Spambayes?
It would be a piece of cake. Just modify Tokenizer() in tokenizer.py to
generate tri(character)grams instead of split-on-whitespace, comment out
tokenize_body(), and run some tests (see the testtools directory). Post
the results to the list (or spambayes-dev), and there you go!
Note that there has been some header/body only testing done, and the
results were reasonable. Googling through the archives should pop the
relevant posts up. In particular, header-only classification looked
like it would be good enough to temporarily leave spam on the server
(while you were away, or when connecting via a mobile, or something like
that). (This is with s-o-w, not trigrams).
More information about the Spambayes