[Spambayes] How low can you go?
Bill Yerazunis
wsy at merl.com
Sun Dec 14 10:19:25 EST 2003
From: "Tim Peters" <tim.one at comcast.net>
[Bill Yerazunis]
> I tried that too - for each window stepping, only the most extreme
> probability was used. Essentially this decorrellated the incoming
> stream so that Bayesian modeling was a little more accurate.
I think details matter a lot here, and I doubt they left your results
predictive of ours. Two things in particular:
1. We do Bayesian modeling of individual word spamprobs, but there's
nothing Bayesian about the way we combine spamprobs. As the graphs on
http://spambayes.sourceforge.net/background.html
show, we had relatively enormous "spectacular failure" rates when
using Bayesian combining in the very early days, but those dropped
so low after moving to chi-combining that there are no instances
of a spectacular failure at all on the third graph. By
"spectacular failure" I mean an extremely low-scoring spam or
extremely high-scoring ham. Graham's scheme produced tons of
these (compared to what eventually proved possible).
Do you have a good writeup of chi-combining online?
I still haven't quite managed to wrap my head around it from what
I've read.
2. Gary suggested a window-based scoring gimmick, but I didn't
implement it that way because it was too poor an approximation to
"strongest over all possible tilings". Instead it was done like:
throw all the features into a bag
while the bag isn't empty:
pick a feature F with maximal strength among all
features still in the bag (meaning a feature whose
spamprob is maximally distant from 0.5, in either
direction)
feed F's spamprob into scoring
remove every feature from the bag that intersects with
F in at least one position (in particular, that
removes F from the bag, and possibly other features too)
Ah, OK. That's quite different.
I just took:
- for each window position
- put the maximal strength feature in as the local probability
for the window position.
- throw the rest of this window's features away.
> What _has_ worked better is to use a Markov model instead of a
> Bayesian model; that actually gets me down to 56.
>
> I haven't tried tiling Markov yet... oh dear... another CPU-day
> down the tubes. :)
Since Markov models come in 150 flavors of their own, I'll wait until you
write a paper about it <wink>.
It's more like one or two slides. :-(
Here's the quick description...
Right now, the way CRM114 calculates local probabilities for each
polynomial on each window is biased strongly toward 0.5. The
previous formula that worked best was:
hits_this_corpus - (hits_all_corpus - hits_this_corpus)
0.5 + ---------------------------------------------------------
16 * ( hits_all_corpus + 1)
which as you can see, needs a LOT of corpus hits on a feature to get much
certainty out of it.
For a corpus hapax that recurs in the test message, the local probability
would be 0.5 + 1/32 = .53125
With 16 hits on it, all spam, the local probability would be 0.5 +
16/(16*17) = .5588
...um... that's not right.
and now that I'm staring at it... I'm wondering how this could
ever have worked, what I was thinking when I wrote it, and the
shocking realization that I need to spend a LOT more time
checking my code...
-Bill Yerazunis
More information about the Spambayes
mailing list