[Spambayes] More on 'Spammer Attempts to Circumvent BayesianFilter'

Tue Jul 20 19:51:18 CEST 2004

> From: Richard B Barger ABC APR
> Sent: Tuesday, July 20, 2004 11:14 AM
>
>
>

Thanks for the effusive praise.  While I'm far from a guru on this
subject, it is fun to play with the concepts.  I'm glad you are
interested in knowing more about how it actually works.  The more fresh
eyes that look at this, the better off we all are.

<...>

> I still speculate that, over a large enough number of users,
> the longer the "normal-seeming" narrative, the more hammy the
> message  appears to their individual SpamBayes tokenizers.
>
> In an entire standard dictionary, there are:
>
> - far, far more words that no one uses than that most people
> use; these would be discounted by SpamBayes

Strongly agree.  Most people's daily vocabulary is a tiny subset of our
language.  I can only assume that it is the same for speakers of other
languages.  However, it is worth noting that different individuals have
different vocabulary subsets.  Some people use words and phrases that
others never use, and sometimes force them to consult a dictionary to
expand their own vocabulary!

>
> - far more words that are generally considered ham (across a
> large number of people) than that are considered spam (by SpamBayes
> training)

This is still the part that I am not entirely comfortable with.  Here
I'm straying away from my own area of knowledge and perhaps entering
yours, so I am definitely on shaky ground here, but let me take a shot
at it.  This hypothesis sounds reasonable for an individual user.  If
so, the converse of it is that the spammy words would _much_ spammier.
I argue this based on word frequency, even though Spambayes only counts
a given word once in a message, even it appears multiple times.  If
there were more hammy words than spammy words in a given recipient's
incoming message stream, and the recipient used a training regimen that
resulted in a database that represented the most significant hammy and
spammy words for them, the spammier words would have to be "stronger"
clues than the hammy ones in order to achieve correct classification of
the messages.  This is somewhat mitigated by the mathematical chicanery
inside Spambayes, which is a bit biased toward ham, since false
positives are more harmful than false negatives.  However, I think that
in general, the idea that fewer spammy words would result in their being
"stronger" clues is probably true.

This suggests that long narratives may contain more hammy words, but the
spammy ones would approximately balance them out in the overall score.
My counter-hypothesis is that if you tested "narratives" constructed
from words chosen at random from the dictionary (what has been called
"word salad" on this list - that's a lovely image, isn't it?), when
evaluated against a large group of users' databases, you would get an
approximately neutral score.  Some databases would score it as ham, some
would score it as spam, but my _guess_ is that the largest group would
classify it as unsure.

I'm not disputing that there is probably a subset of the hammy words in
the ensemble of databases of the Spambayes user community that has
higher than 50% incidence.  What I'm questioning is how one would go
about guessing what those words might be.  You would need access to the
individual databases.  Alternatively, you would need access to  the
individual incoming mail streams, plus any rules that excluded certain
messages from evaluation by Spambayes.  Most importantly, I don't
believe, though I can't prove this, that selecting words at random from
the dictionary would accomplish this.

>
> To me, that means that a couple of relatively long (How long?
> I have no idea!) neutral-seeming narrative passages would be likely
> to raise ham content scores somewhat, because the longer they are,
> the  more likely they are to contain more ham-appearing words.

If my argument above has any merit, which is questionable, using more
random words would only result in a larger probability of being
classified as unsure, as the expected result, meaning the result that
one would get if they included all the words in the dictionary, would be
unsure.  The occasional classification of "word salad" into ham or spam,
IMHO, results from the sample of words being too small and thus the
statistical variation (variance) of the classification is significant
compared to the expected value (the mean).  Another way to put this is
that for those user databases that classified it as ham, the small
sample of word salad represented a lucky guess.

>
> Would this be enough to overcome SpamBayes?  Of course not,
> in most cases.  And if a user is sufficiently interested in
> avoiding spam to use this excellent product, he's certainly
> not going to be tricked by  a spam message that makes it into
> his unsure -- or even his ham -- folder.
>
> So I'm not talking, in particular, about outcomes for the
> spammer.  I'm just
> interested in the theory behind SpamBayes' handling of larger coherent
> narratives, which in my sample of 34,678 messages now represent the
> second-most-frequent type of file that hits my Unsure folder.
>
> I, too, recognize them when I see them.  I'm just trying to
> figure out how to make SpamBayes equally sensitive.  <g>

Train on what you consider ham and spam, keeping the number trained in
each category similar, and it will learn.  Remember that some header
information is included as tokens, so Spambayes does tend to form a
whitelist of your correspondents, if you train on their messages.
Removing messages from our inbox coming from lists that have good spam
filtering both reduces the load on Spambayes and reduces the vocabulary
it has to train on, which has made it more effective in my environment.
If there are misclassifications in your training database, this is very
hard to overcome and it is sometimes better to start over.

--

Seth Goodman