[Spambayes] FYI: Java implementation
tim.one at comcast.net
Mon Jan 20 23:50:41 EST 2003
> I have a very large training corpus, so I'm seeing well-
> separated distributions of good versus spam probs, with a
> sprinkling of "unsures" scattered through the middle. An
> uncertain cutoff at 3 sigma from the means should work, but this
> notion needs some testing. That chi2 test is definitely on the
> drawing boards, even if only for comparison purposes...
Anthony Baxter has some plots of score distributions for Graham-combining,
Gary-combining and chi-combining here:
It's the sharpness and spread of the separation in chi- that's attractive.
Our experiments showed (most of mine were on a 34,000-msg database) that you
could usually pick cutoffs equally good under Gary-combining, but that it
took 3 decimal digits of precision to do so, best cutoffs kept shifting over
time (== amount of training data) and across test sets, and that it wasn't
possible to guess good values in advance. In contrast, canned chi- cutoff
values with 1 decimal digit of precision worked well for just about
everyone. The primary size-related (# of training msgs) effect I noticed is
that the chi- unsure range could be profitably shrunk the more msgs trained
on, but even if you didn't bother, your original cutoffs continued to work
well (although, as with Gary-combining, *optimal* cutoffs shifted too; chi-
degraded more gently if you didn't bother to change them).
More information about the Spambayes