[Spambayes] optimal max_discriminators for chi2

Tim Peters tim.one@comcast.net
Sat Oct 19 06:20:24 2002


[Rob Hooft]
> I did a series of runs:
> =========================
> [Classifier]
> use_chi_squared_combining: True
> robinson_minimum_prob_strength = 0.0
> robinson_probability_s = 0.45
> max_discriminators = XXXXXX
> ...

> With XXXXXX between 15 and 300. Attached are plots of the 95th
> percentile ham, 5th percentile spam, and of the total cost vertical
> against max_discriminators horizontal. Please note again that my ham is
> much tighter than my spam: vertical scales are from 0 to 0.16 and from
> 89 to 100, respectively (Almost a factor of 100!). The cost plot shows
> "no trend at all", but the variation is not large.

Thanks, Rob!  Have you ever plotted the density of the number of "words" in
your msgs?  I did at one time but have forgotten the result; IIRC, a
surprisingly large percentage didn't *have* 150 distinct words (but then I'm
also using the default robinson_minimum_prob_strength, which renders a whole
bunch of bland words invisible).

The cost plot is disturbing, suggesting we're looking at random effects more
than trends.  Perhaps "best cost" is just too fickle a measure here, and it
would be better to develop a measure of "average cost" across all cutoff
pairs within the specified base (ham_cutoff, spam_cutoff) pair.

> I'd almost conclude "anything goes", but based on the spam-5% value
> I'd like to stick with values over ~40.

This sounds sensible to me too, and my own data doesn't contradict it
<wink>.  I'll leave the default at 150 until there's a clear reason to
change it.