[Spambayes] How low can you go?
wsy at merl.com
Sat Dec 13 18:20:57 EST 2003
From: "Tim Peters" <tim.one at comcast.net>
That's where Gary Robinson's "special scoring gimmick" comes in, a way to
*count* no more than one feature per source token when scoring. In the
example, it might decide to score "penis enlargement" as a single feature,
but, if it did, it would *not* also feed the spamprobs of "penis" and
"enlargement" into the final score; or it might decide to feed the spamprobs
of both constituent words into the final score, in which case it would leave
the spamprob of the bigram out of the score. In effect, scoring "tiles" the
source with a collection of non-overlapping unigram and bigram features,
picked in such a way as to approximate maximizing the aggregate spamprob
strengths over all possible tilings.
That wasn't tested enough to ensure it achieved what it was after, but it
made a lot of theoretical sense, and worked fine in small preliminary tests.
The point is to get faster learning without increasing the "spectacular
failure" rate (which has always been very small, but isn't 0, and would most
likely get much larger (but still remain "small"!) without a gimmick to
counteract systematic correlation).
I tried that too - for each window stepping, only the most extreme
probability was used. Essentially this decorrellated the incoming
stream so that Bayesian modeling was a little more accurate.
But the results were a statistical failure.
the error rate on my standard test corpus jumped from 68 (using
no correction) to 80 using this "tiling" method.
What _has_ worked better is to use a Markov model instead of a
Bayesian model; that actually gets me down to 56.
I haven't tried tiling Markov yet... oh dear... another CPU-day
down the tubes. :)
More information about the Spambayes