[Spambayes] Training without ham

Tim Peters tim.one@comcast.net
Mon Oct 28 21:33:59 2002


[T. Alexander Popiel]
> Summary: Ham is required in the training set, as expected.
> ...
> So yes, spambayes is worthless without ham in the training corpus.

Ya, but that doesn't prove we need to train on spam <wink>.

I posted a variant update_probabilities yesterday, which ignored hamcounts
when computing spamprobs.  What I didn't report on was trying that, after
fiddling a combining method to merely compute the average spamprob in a msg.

Histogram analysis consistently suggested that my best strategy was to set
ham_cutoff at 0.0 then, and spam_cutoff at 1.0; i.e., to call *everything*
"unsure".

The (possibly surprising) reason can be deduced from this (from a 10-fold
randomized CV run over 2000 of each):

-> <stat> Spam scores for all runs: 2000 items; mean 19.54; sdev 8.85
-> <stat> min 3.97721; median 19.6394; max 71.6909
-> <stat> percentiles: 5% 4.65238; 25% 14.8778; 75% 23.7485; 95% 34.8339

-> <stat> Ham scores for all runs: 2000 items; mean 24.17; sdev 7.97
-> <stat> min 4.2792; median 23.4837; max 73.9471
-> <stat> percentiles: 5% 12.0403; 25% 19.4017; 75% 28.2031; 95% 37.6717

IOW, ham scores *higher* than spam for spamness under this measure, although
the overlap is extreme.  I wasn't much motivated to pursue this <wink>.