[Spambayes] Suggestion

Thu Nov 23 21:14:09 CET 2006

Mr. A.J. O'Neill wrote on Friday, November 17, 2006 7:25 PM -0500:

> One of the many steps involved in the processing was the removal (or
> ignoring of) punctuation before searching for search tokens. I draw
> your attention to the following extract from a Spam Clues report
>
> 'beneficiary'                       0.844828            0      1
> 'beneficiary.'                      0.844828            0      1
>
> I would argue that there is no difference between these two tokens
> and that the inclusion of the punctuation adds nothing to the process
> but in this instance is likely to give the token a lower score than
> may be appropriate.

This type of specific choice in the tokenizer resulted from testing in a
number of people's working environments.  It was shown to improve
classification empirically.  This suggests that the intuition behind
your argument, which I originally shared as well, is not correct for the
purpose of classifying email as ham/spam at the time this was tested.  A
lot of the small choices in Spambayes turn out to be the results of
empirical testing rather than intuition, and it's surprising
(non-intuitive) how often our intuition about our own language is
incorrect.

If you're looking for a reason to explain the empirical results, one
possibility is that it provides differentiation based on grammar, as
opposed to just word occurrence.  This is something that you normally
don't get with a tokenizer that only recognizes words and not sentence
structure.

> I also used a stop list of words which are so common that they are
> useless to index or use in search engines or other search indexes.
> Below are a number of instances of words which I believe are not
> appropriate tokens to use to differentiate between spam and ham
> emails.

There is a clash between the philosophy of naive Bayesian classification
and rule-based schemes.  The idea behind rule-based schemes is that we
can tap human beings' pattern recognition ability to create rules that
we run in a computer.  Since we can recognize spam easily when we see
it, we are the best experts to consult when forming a rule set.  The
problem with this notion is that computers are not currently capable of
creating inferences in the same way as people because the system
architecture is so different.  While people can indeed reliably
distinguish spam, often from only a part of the message, they cannot
reliably tell you how they made the decision.

The aim of naive Bayesian classification is to avoid all the particular
problems of trying to construct a useful rule set and instead look at
simple statistical properties of language the do not require human-like
inference.  The underlying model is fundamentally different.  A Bayesian
classifier is not trying to emulate a speaker of natural language.  The
approach has strengths as well as weaknesses.

One of the strengths is that you don't have to decide what words you
think are the best or worst spam indicators.  If you tend to favor
rule-based approaches, this also looks like a huge weakness.  The
classifier learns word probabilities by observing your message
classifications.  To the extent that you are surprised by the spam
probabilities of individual words, you would make the classifier worse
by manually overriding the training results on a token-by-token basis.
This happens far more often than you would think.  Words that indicate a
spam likeliness equal to a ham likeliness score somewhere near 0.5 and
do not contribute to the final score.

Another of the strengths is that the word probabilities vary widely
among different recipients.  It's a strength because there is no such
thing as a ham word list that will reliably avoid Bayesian classifiers.
That's also a weakness, if you wish to apply Bayesian methods on a
server without tracking the word probabilities separately for each
mailbox.  What this suggests is that it is equally difficult to come up
with a list of words that the classifier should ignore that would work
for most users.

There is a fundamental disagreement in the approaches of Bayesian and
rule-based systems.  Proponents of rule-based systems believe that
people can best identify what clues are most significant, while
proponents of Bayesian systems either believe that people cannot
reliably identify the most important clues, or even if they can, they
don't care to do so.  The last condition is important if spam avoidance
is simply a utilitarian goal, not a hobby.

Personally, I tried rule-based systems first and then experimented with
Spambayes.  I found that my intuition on word probabilities was indeed
wrong a significant proportion of the time and the naive Bayesian
approach did about as well as my rule-based system when it was at its
peak.  The Bayesian approach required much less maintenance and it works
well for a wide variety of end-users without requiring insight from
them.  I still feel there are very useful rules to help detect spam that
are complimentary to word frequency.  These are things such as whether
the message comes from a particular mailing list, whether the sending IP
is on a DNS blacklist that I choose or to which one of my mailbox
addresses the message is addressed.  My own compromise on this is to
either put them in the domain MTA, or to write Outlook rules that run
before the Bayesian classifier.

In terms of overall system architecture, I tend to believe that the
rule-based approaches belong in the domain MTA, whenever possible, and
should generate rejections during the SMTP session, preferably before
DATA.  This eliminates most of the spam at the lowest possible system
cost and with the largest savings in bandwidth.  You can eliminate
another significant amount of spam by running rule-based content
filters, such as SpamAssassin, in the MTA.  This is very expensive, so
it is important to run it on as few messages as possible.  This
generates rejections at the end of DATA, which are still useful for
legitimate messages that are improperly classified.  For the spam that
slips through global rule-base systems, it then makes sense to do
computationally intensive and user-specific content filtering like
Spambayes in the MUA.  The spam load is hopefully reduced enough that
the end-user doesn't mind scanning the junk folder for the occasional
false positive.

--
Seth Goodman