[Spambayes] Proposing to remove 4 combining schemes

Tim Peters tim.one@comcast.net
Sat Oct 19 07:11:49 2002


[T. Alexander Popiel]
> ...
> The score we've got is just a number in the range 0 to 1 which has
> interesting discriminatory properties.  It's not linear with any
> concept of surety, and it's not linear with similarity to spam or
> ham, either.

Ah, but it is linear with 1 minus the probability that -2 times the natural
log of the geometric mean of 1-p_i for a vector of random probabilities p
would exceed 1 minus -2 times the natural log of the geometric mean of 1-p_i
for the estimated spamprobs in the message, minus 1 minus the probability
that -2 times the natural log of the geometric mean of p_i for a vector or
random probabilities p would exceed 1 minus -2 times the natural log of the
geometric mean of p_i for the estimated spamprobs in the message.

> People not immersed in how it's generated and/or buried in test results
> over decent sized corpora are sure (there's that troubling word again)
> to misinterpret it.
> </rant>

Even given a clear explanation like the above?  I vote we put that in the
user docs, and strongly imply that anyone to whom that isn't obvious from
mere inspection is an idiot who deserves all the spam they get <wink>.

[Rob]
>> But Sean's "sort on score" idea is also very useful. I think it'd speed
>> up the manual scanning/deletion process.

[Alex]
> Having looked at the results from the show_unsure config option,
> I tend to disagree... position in the list doesn't seem to have
> any correlation with spam vs. ham.

Are you sure?  I've got a GUI that sorts email by "Hammie score" now, and
there's a clear correlation by eyeball *adjacent to the endpoints* of the
unsure range.  The middle of the unsure range is a jumble, though, and
predictably so since long messages suffering cancellation disease in
particular predictably score near very close to 0.5 under chi-combining.
Where Graham-combining would score them at 0.0 or 1.0 depending on which
flavor of clue just happened to appear more often, chi scores them more like
0.49999 or 0.50001.  It's still a coin toss, but of an exceedingly tiny coin
<wink>.