[spambayes-dev] message subject filtering
John Moriarty
heli at helimodels.com
Wed Sep 1 02:42:10 CEST 2004
Hello again
All I think I am saying is that very many times the spam falls at the first
fence.
It appears that the new (random/gibberish) words have no effect, good.
...SpamBayes needs
to work equally well regardless of the ratio of ham vs. spam that a
particular user receives...
...so what's best training to do when spams vastly predominate?
Question, how many ppl would open any of these the last few spams I just
got:
In Spam folder:
account update
RE: Sensually get some of the action
Here is your 50 dollar restaurant card
Online Canadian Generic Phamacy -- Next Day Shipping! Lowest Prices on ...
meeting tommorow at 18-00
Buy pain relief medicine at Unbelievable Prices
account update
RE: Sensually get some of the action
Here is your 50 dollar restaurant card
Online Canadian Generic Phamacy -- Next Day Shipping! Lowest Prices on ...
meeting tommorow at 18-00
Buy pain relief medicine at Unbelievable Prices
Get cash out of your house
Lowest cost for potency meds today nadir cheer coda theses arbiter molehill
line bear joke regulate countryside cruickshank hegemony hanley consoOrder
your meds online with confidence
Attract and Seduce Men
In Possible junk folder:
Make $397
Seems to me processing would speed up with these dodgy headers.
Wonder if this message with all those spam message header quotes gets
deleted as spam;)
Kind regards,
John Moriarty
(+353) (0)87 2833 530
www.helimodels.com
-----Original Message-----
From: Kenny Pitt [mailto:kennypitt at hotmail.com]
Sent: 31 August 2004 19:58
To: 'John Moriarty'; spambayes-dev at python.org
Cc: 'David Kirwan'
Subject: RE: [spambayes-dev] message subject filtering
John Moriarty wrote:
> A lot of spam shows:
>
> * Ungrammatical and/or irrelevant wording
> * Random words
> * Gibberish words
> * Deliberately weird or obscure punctuations
> * Since this is true in the header as well as the text body, this
> potentially reduces the loads on the filter.
Are you interested in this because you want to analyze just the headers and
not download the entire message if it is determined to be spam? If so,
there are other issues besides whether or not we can successfully identify
the spam just based on the headers. In the case of the Outlook Add-in,
Outlook has already downloaded the message by the time we are told about it.
In the case of the POP3 proxy (sb_server), discarding a message that you
have partially processed is problematic because the e-mail client is already
aware that the message exists and will sometimes get confused if we refuse
to give it any data.
> Random words not seen
> before seem to allow stuff through more easily.
In the case of SpamBayes, this is not true. SpamBayes assigns a probability
of 0.5 to any word that it hasn't been trained on, and then discards any
words that have a probability between 0.4 and 0.6 before calculating the
spam score. Because SpamBayes ignores these words, they have absolutely no
effect, either positive or negative, on the classification of the message.
The only time that random words have an effect on the classification is if
the spammer happens to hit on some words that you *have* seen before. If
those words have only been seen in spam messages then it only *increases*
the probability that the message will be properly identified as spam. It is
very rare for the spammer to stumble across a significant number of words
that you have trained as hammy, and even then there aren't usually enough of
them to outweigh the other spammy clues in the message.
> * I also note spam outnumbers ham by up to 100 to one
Maybe for you, but not necessarily for everyone. While it does seem that
most people these days are receiving more spam than good messages, there are
still some people (someone who is extremely active on a lot of high-volume
mailing lists, possibly) that get far more ham than spam. SpamBayes needs
to work equally well regardless of the ratio of ham vs. spam that a
particular user receives.
> And invariably the text body contains the web address of the seller,
> so a web address of itself is a giveaway.
SpamBayes has an option that will break up URLs and create clues from the
domain name, directory names, etc. If a particular domain is used a lot in
spam then that will become a spam clue. The mere presence of a URL in the
message is not a good indicator of spam in general. I receive a lot of
legitimate mail such as developer newsletters that contain lots of URLs.
> I am fast at identifying spam by the header alone, using the above
> observations I reckon I spot 90% plus in a blink.
The human brain has a capacity for learning and detecting patterns in the
text that far exceeds what SpamBayes can ever be capable of. In most cases,
however, SpamBayes can probably process the entire message in less time than
you can process just the header. The more information SpamBayes has at its
disposal, the less likely it is to make a mistake and toss an important
message into your spam folder.
--
Kenny Pitt
More information about the spambayes-dev
mailing list