[Spambayes] When the words aren't statistically independent/Two excellent articles

Richard B Barger ABC APR Rich at RBarger.com
Sun Aug 8 18:45:02 CEST 2004


Hi, all.

Returning to a discussion several of you were kind enough to engage in
with me a couple of weeks ago:  I commented that "running narrative
text" seemed to be a confounding factor contributing to SpamBayes
misclassifying messages.

I've just stumbled across a couple of papers that the gurus on this
listserv already are familiar with, but that they haven't talked about
here.  In one case, doubtless, the fascinating paper probably is a bit
technical for this list.  However, for anyone who is geeky enough to
want more in-depth info, I commend these to your attention; if you're
non-technical, just skip the math.

The first, "That Gibberish in Your In-Box May Be Good News" --
http://www.ladlass.com/archives/001406.html  -- is highly readable and
even entertaining.

Sample:

"For the spammer, the hope, slim as it seems, is that a few curious
souls will open and read the e-mail, which begins, 'I finally was able
to lsoe the wieght' and goes on to offer a product 'Guanarteed to work
or your menoy back!' Read out loud, the message sounds a little like HAL
the computer in '2001: A Space Odyssey' sinking into aphasia as its
synapses are severed one by one."

The second, more technical, piece --
http://crm114.sourceforge.net/Plateau_Paper.pdf -- addresses the
"running narrative text" issue I asked about, but uses a cleaner
technical terminology, by speaking of features -- words -- that are not
statistically independent:

<quoting>

One failing of the Bayesian chain rule is that strictly speaking it is
only valid in the
case of all features being statistically independent. This is
emphatically not the case
in text analysis; the words are not chosen randomly but with a very
significant
correlation. What is remarkable about Bayesian text classifiers is that
the Bayesian
classifiers work so well even with this gaping fundamental error.

To avoid the error of presumed deceleration, is possible to use a
chi-squared or
other combining rules. SpamBayes uses a modified Chi-squared combining
rule.

<end quoting>

"... the words are not chosen randomly, but with a very significant
correlation."

That's what I was trying to say in the earlier discussion, but I was
unable to frame it so elegantly.

Read 'em and enjoy.

Cheers!

Rich Barger
Kansas City







More information about the Spambayes mailing list