[Spambayes] How low can you go?

Bill Yerazunis wsy at merl.com
Sun Dec 14 10:19:25 EST 2003


   From: "Tim Peters" <tim.one at comcast.net>

   [Bill Yerazunis]
   > I tried that too - for each window stepping, only the most extreme
   > probability was used.  Essentially this decorrellated the incoming
   > stream so that Bayesian modeling was a little more accurate.

   I think details matter a lot here, and I doubt they left your results
   predictive of ours.  Two things in particular:

   1. We do Bayesian modeling of individual word spamprobs, but there's
      nothing Bayesian about the way we combine spamprobs.  As the graphs on

	  http://spambayes.sourceforge.net/background.html

      show, we had relatively enormous "spectacular failure" rates when
      using Bayesian combining in the very early days, but those dropped
      so low after moving to chi-combining that there are no instances
      of a spectacular failure at all on the third graph.  By
      "spectacular failure" I mean an extremely low-scoring spam or
      extremely high-scoring ham.  Graham's scheme produced tons of
      these (compared to what eventually proved possible).

Do you have a good writeup of chi-combining online?

I still haven't quite managed to wrap my head around it from what
I've read.

   2. Gary suggested a window-based scoring gimmick, but I didn't
      implement it that way because it was too poor an approximation to
      "strongest over all possible tilings".  Instead it was done like:

       throw all the features into a bag
       while the bag isn't empty:
	   pick a feature F with maximal strength among all
	       features still in the bag (meaning a feature whose
	       spamprob is maximally distant from 0.5, in either
	       direction)
	   feed F's spamprob into scoring
	   remove every feature from the bag that intersects with
	       F in at least one position (in particular, that
	       removes F from the bag, and possibly other features too)

Ah, OK.  That's quite different.

I just took:

 - for each window position
    - put the maximal strength feature in as the local probability
      for the window position.
    - throw the rest of this window's features away.

   > What _has_ worked better is to use a Markov model instead of a
   > Bayesian model; that actually gets me down to 56.
   >
   > I haven't tried tiling Markov yet... oh dear... another CPU-day
   > down the tubes.  :)

   Since Markov models come in 150 flavors of their own, I'll wait until you
   write a paper about it <wink>.

It's more like one or two slides.  :-(

Here's the quick description...

Right now, the way CRM114 calculates local probabilities for each
polynomial on each window is biased strongly toward 0.5.  The 
previous formula that worked best was:
 
              hits_this_corpus - (hits_all_corpus - hits_this_corpus)
      0.5 + ---------------------------------------------------------
                        16 * ( hits_all_corpus + 1)
	

which as you can see, needs a LOT of corpus hits on a feature to get much
certainty out of it.  

For a corpus hapax that recurs in the test message, the local probability
would be 0.5 + 1/32 = .53125

With 16 hits on it, all spam, the local probability would be 0.5 +
16/(16*17) = .5588

...um... that's not right.

and now that I'm staring at it... I'm wondering how this could
ever have worked, what I was thinking when I wrote it, and the
shocking realization that I need to spend a LOT more time 
checking my code...

	 -Bill Yerazunis





More information about the Spambayes mailing list