[Spambayes] How low can you go?

Sat Dec 13 18:20:57 EST 2003

   From: "Tim Peters" <tim.one at comcast.net>

   That's where Gary Robinson's "special scoring gimmick" comes in, a way to
   *count* no more than one feature per source token when scoring.  In the
   example, it might decide to score "penis enlargement" as a single feature,
   but, if it did, it would *not* also feed the spamprobs of "penis" and
   "enlargement" into the final score; or it might decide to feed the spamprobs
   of both constituent words into the final score, in which case it would leave
   the spamprob of the bigram out of the score.  In effect, scoring "tiles" the
   source with a collection of non-overlapping unigram and bigram features,
   picked in such a way as to approximate maximizing the aggregate spamprob
   strengths over all possible tilings.

   That wasn't tested enough to ensure it achieved what it was after, but it
   made a lot of theoretical sense, and worked fine in small preliminary tests.
   The point is to get faster learning without increasing the "spectacular
   failure" rate (which has always been very small, but isn't 0, and would most
   likely get much larger (but still remain "small"!) without a gimmick to
   counteract systematic correlation).

I tried that too - for each window stepping, only the most extreme
probability was used.  Essentially this decorrellated the incoming
stream so that Bayesian modeling was a little more accurate.

But the results were a statistical failure.

the error rate on my standard test corpus jumped from 68 (using 
no correction) to 80 using this "tiling" method.

What _has_ worked better is to use a Markov model instead of a 
Bayesian model; that actually gets me down to 56.

I haven't tried tiling Markov yet... oh dear... another CPU-day
down the tubes.  :)

     -Bill Yerazunis