[Spambayes] Re: FYI: Java implementation

Tue Jan 21 12:03:20 EST 2003

[Michael Hudson, on the plots at
    http://spambayes.sourceforge.net/background.html
]

> I meant to say it when I first looked at that page, but seeing those
> plots nearly made my eyeballs fall out.  Why does anyone still use
> Graham-combining?

Perhaps because the "Plan for Spam" paper kept on describing it, and people
who tried it found that their first stab worked better than anything else
they had tried.  It took much testing on large and varied data before its
problems became clear.  Paul Graham has since discovered some of these on
his own, as he started getting his own false positives:

    http://www.paulgraham.com/better.html

Graham-combining has the advantage of being rigorously correct, to the
extent that its assumptions hold (word independence, and prior spam
probability of 0.5).  I can't really say what chi-combining produces in the
end, other than that "it's a score".  It's certainly not the probability
that a msg is spam.  Graham-combining does compute a spam probability, which
would be correct if only the world were nothing like it is <wink -- I simply
mean that the assumptions under which the calculation would be correct don't
hold in the real world>.

So it's explainable and works remarkably well out of the box.  Its problems
are more-or-less subtle, and people have little patience for subtleties.