Re: [Spambayes] fwd: robinson f(w) equation: X constant confusion
Tim Peters said:
Rob Hooft ran some downhill Simplex optimizations that also converged on X a bit over 0.5, and S substantially smaller than we use by default (we use 0.45 by default). On three different sets of test data, I measured "the average" spamprob to be a bit over 0.5 too (it ranged from 0.52 to 0.56).
Interesting!
A difference is that the test data I used had about the same number of ham as spam, while you've got a 1::2 ratio. Are you sure you weren't using 1000 spam vs 2000 ham? If you were, and "the true unknown word" spamprob were about 0.5, I'd expect you to measure one near 1/3, since there would be (to a 0th-order approximation <wink>) about twice as many ham-word spamprobs feeding into the computed average than there were spam-word spamprobs feeding into it, and that would drag the average below 0.5 simply due to having more of one kind of word than the other.
Actually, I've just checked -- it's not 2k:1k, it's 2k:2k. so it should be even.
IOW, Gary's suggestion for guessing x appears to me to be sensitive to the ham::spam ratio, but the method used for guessing spamprobs tries (with mixed results) not to be sensitive to that ratio. Mismatching assumptions, then.
Interesting, BTW. Do you guys use the estimated X instead of a constant? Sounds like it could vary greatly depending on corpus ratios...
The X=0.53 S=0.05 result is cute -- it roughly says "it's about 50-50, but don't pay much attention to it".
There's another "sweet spot" at X=.69 and S=.42, which mystifies me; I would have thought that would cause more FPs, which is worst for the cost (see below).
I'm not sure what your cost measure is; as we measure costs by default, an FP is charged 10, in which case the contour lines ranging from 80 to 90 are showing the difference between one FP more or less; this *can* make them supremely sensitive to just one or two oddball msgs.
The cost measure is a direct copy of the spambayes one, so they can be compared ;) (I also use TCR, the cost measure used by Ion Androutsopoulos' papers; but being able to see "unsures" helps us pick a good scheme which maps well into SpamAssassin scores.) BTW an interesting factor is that those scores are measured using a high "min prob strength" factor; I used 0.27. I'm running more tests where this varies, and I think that'll be quite interesting too ;) PS: while I'm here -- I'm also comparing chi2 with gary-combining. I'm finding chi2 to have quite a few more FPs in particular, right in the 0.00 spike. Do you guys see much of this? Or have I screwed up my code with all this constant-tweaking? ;) --j.
participants (1)
-
jm@jmason.org