[Spambayes] How low can you go?
wsy at merl.com
Sun Dec 14 10:19:25 EST 2003
From: "Tim Peters" <tim.one at comcast.net>
> I tried that too - for each window stepping, only the most extreme
> probability was used. Essentially this decorrellated the incoming
> stream so that Bayesian modeling was a little more accurate.
I think details matter a lot here, and I doubt they left your results
predictive of ours. Two things in particular:
1. We do Bayesian modeling of individual word spamprobs, but there's
nothing Bayesian about the way we combine spamprobs. As the graphs on
show, we had relatively enormous "spectacular failure" rates when
using Bayesian combining in the very early days, but those dropped
so low after moving to chi-combining that there are no instances
of a spectacular failure at all on the third graph. By
"spectacular failure" I mean an extremely low-scoring spam or
extremely high-scoring ham. Graham's scheme produced tons of
these (compared to what eventually proved possible).
Do you have a good writeup of chi-combining online?
I still haven't quite managed to wrap my head around it from what
2. Gary suggested a window-based scoring gimmick, but I didn't
implement it that way because it was too poor an approximation to
"strongest over all possible tilings". Instead it was done like:
throw all the features into a bag
while the bag isn't empty:
pick a feature F with maximal strength among all
features still in the bag (meaning a feature whose
spamprob is maximally distant from 0.5, in either
feed F's spamprob into scoring
remove every feature from the bag that intersects with
F in at least one position (in particular, that
removes F from the bag, and possibly other features too)
Ah, OK. That's quite different.
I just took:
- for each window position
- put the maximal strength feature in as the local probability
for the window position.
- throw the rest of this window's features away.
> What _has_ worked better is to use a Markov model instead of a
> Bayesian model; that actually gets me down to 56.
> I haven't tried tiling Markov yet... oh dear... another CPU-day
> down the tubes. :)
Since Markov models come in 150 flavors of their own, I'll wait until you
write a paper about it <wink>.
It's more like one or two slides. :-(
Here's the quick description...
Right now, the way CRM114 calculates local probabilities for each
polynomial on each window is biased strongly toward 0.5. The
previous formula that worked best was:
hits_this_corpus - (hits_all_corpus - hits_this_corpus)
0.5 + ---------------------------------------------------------
16 * ( hits_all_corpus + 1)
which as you can see, needs a LOT of corpus hits on a feature to get much
certainty out of it.
For a corpus hapax that recurs in the test message, the local probability
would be 0.5 + 1/32 = .53125
With 16 hits on it, all spam, the local probability would be 0.5 +
16/(16*17) = .5588
...um... that's not right.
and now that I'm staring at it... I'm wondering how this could
ever have worked, what I was thinking when I wrote it, and the
shocking realization that I need to spend a LOT more time
checking my code...
More information about the Spambayes