[Spambayes] How low can you go?

Bill Yerazunis wsy at merl.com
Sun Dec 14 10:19:25 EST 2003

   From: "Tim Peters" <tim.one at comcast.net>

   [Bill Yerazunis]
   > I tried that too - for each window stepping, only the most extreme
   > probability was used.  Essentially this decorrellated the incoming
   > stream so that Bayesian modeling was a little more accurate.

   I think details matter a lot here, and I doubt they left your results
   predictive of ours.  Two things in particular:

   1. We do Bayesian modeling of individual word spamprobs, but there's
      nothing Bayesian about the way we combine spamprobs.  As the graphs on


      show, we had relatively enormous "spectacular failure" rates when
      using Bayesian combining in the very early days, but those dropped
      so low after moving to chi-combining that there are no instances
      of a spectacular failure at all on the third graph.  By
      "spectacular failure" I mean an extremely low-scoring spam or
      extremely high-scoring ham.  Graham's scheme produced tons of
      these (compared to what eventually proved possible).

Do you have a good writeup of chi-combining online?

I still haven't quite managed to wrap my head around it from what
I've read.

   2. Gary suggested a window-based scoring gimmick, but I didn't
      implement it that way because it was too poor an approximation to
      "strongest over all possible tilings".  Instead it was done like:

       throw all the features into a bag
       while the bag isn't empty:
	   pick a feature F with maximal strength among all
	       features still in the bag (meaning a feature whose
	       spamprob is maximally distant from 0.5, in either
	   feed F's spamprob into scoring
	   remove every feature from the bag that intersects with
	       F in at least one position (in particular, that
	       removes F from the bag, and possibly other features too)

Ah, OK.  That's quite different.

I just took:

 - for each window position
    - put the maximal strength feature in as the local probability
      for the window position.
    - throw the rest of this window's features away.

   > What _has_ worked better is to use a Markov model instead of a
   > Bayesian model; that actually gets me down to 56.
   > I haven't tried tiling Markov yet... oh dear... another CPU-day
   > down the tubes.  :)

   Since Markov models come in 150 flavors of their own, I'll wait until you
   write a paper about it <wink>.

It's more like one or two slides.  :-(

Here's the quick description...

Right now, the way CRM114 calculates local probabilities for each
polynomial on each window is biased strongly toward 0.5.  The 
previous formula that worked best was:
              hits_this_corpus - (hits_all_corpus - hits_this_corpus)
      0.5 + ---------------------------------------------------------
                        16 * ( hits_all_corpus + 1)

which as you can see, needs a LOT of corpus hits on a feature to get much
certainty out of it.  

For a corpus hapax that recurs in the test message, the local probability
would be 0.5 + 1/32 = .53125

With 16 hits on it, all spam, the local probability would be 0.5 +
16/(16*17) = .5588

...um... that's not right.

and now that I'm staring at it... I'm wondering how this could
ever have worked, what I was thinking when I wrote it, and the
shocking realization that I need to spend a LOT more time 
checking my code...

	 -Bill Yerazunis

More information about the Spambayes mailing list