[spambayes-dev] message subject filtering

Kenny Pitt kennypitt at hotmail.com
Tue Aug 31 20:57:43 CEST 2004


John Moriarty wrote:
> A lot of spam shows:
> 
> * Ungrammatical and/or irrelevant wording
> * Random words
> * Gibberish words
> * Deliberately weird or obscure punctuations
> * Since this is true in the header as well as the text body, this
> potentially reduces the loads on the filter.

Are you interested in this because you want to analyze just the headers and
not download the entire message if it is determined to be spam?  If so,
there are other issues besides whether or not we can successfully identify
the spam just based on the headers.  In the case of the Outlook Add-in,
Outlook has already downloaded the message by the time we are told about it.
In the case of the POP3 proxy (sb_server), discarding a message that you
have partially processed is problematic because the e-mail client is already
aware that the message exists and will sometimes get confused if we refuse
to give it any data.

> Random words not seen
> before seem to allow stuff through more easily.

In the case of SpamBayes, this is not true.  SpamBayes assigns a probability
of 0.5 to any word that it hasn't been trained on, and then discards any
words that have a probability between 0.4 and 0.6 before calculating the
spam score.  Because SpamBayes ignores these words, they have absolutely no
effect, either positive or negative, on the classification of the message.

The only time that random words have an effect on the classification is if
the spammer happens to hit on some words that you *have* seen before.  If
those words have only been seen in spam messages then it only *increases*
the probability that the message will be properly identified as spam.  It is
very rare for the spammer to stumble across a significant number of words
that you have trained as hammy, and even then there aren't usually enough of
them to outweigh the other spammy clues in the message.

> * I also note spam outnumbers ham by up to 100 to one

Maybe for you, but not necessarily for everyone.  While it does seem that
most people these days are receiving more spam than good messages, there are
still some people (someone who is extremely active on a lot of high-volume
mailing lists, possibly) that get far more ham than spam.  SpamBayes needs
to work equally well regardless of the ratio of ham vs. spam that a
particular user receives.

> And invariably the text body contains the web address of the seller,
> so a web address of itself is a giveaway. 

SpamBayes has an option that will break up URLs and create clues from the
domain name, directory names, etc.  If a particular domain is used a lot in
spam then that will become a spam clue.  The mere presence of a URL in the
message is not a good indicator of spam in general.  I receive a lot of
legitimate mail such as developer newsletters that contain lots of URLs.

> I am fast at identifying spam by the header alone, using the above
> observations I reckon I spot 90% plus in a blink.

The human brain has a capacity for learning and detecting patterns in the
text that far exceeds what SpamBayes can ever be capable of.  In most cases,
however, SpamBayes can probably process the entire message in less time than
you can process just the header.  The more information SpamBayes has at its
disposal, the less likely it is to make a mistake and toss an important
message into your spam folder.

-- 
Kenny Pitt



More information about the spambayes-dev mailing list