[Spambayes] Moving closer to Gary's ideal
Tim Peters
tim.one@comcast.net
Mon, 23 Sep 2002 13:24:30 -0400
[Sjoerd Mullender]
> > """
> > [Classifier]
> > use_robinson_probability: True
> > use_robinson_combining: True
> > max_discriminators: 1500
> >
> > [TestDriver]
> > spam_cutoff: 0.50
> > """
>
> I tested this against the default options (except I have
> count_all_header_lines: True and mine_received_headers: True
> permanently)
I'm jealous! If you can use those, you can probably enable Jeremy's
basic_header_tokenize too.
> and got these results:
>
> false positive percentages
> 0.524 1.047 lost +99.81%
> 0.000 0.524 lost +(was 0)
> 0.524 0.524 tied
> 0.524 1.047 lost +99.81%
> 0.524 1.571 lost +199.81%
>
> won 0 times
> tied 1 times
> lost 4 times
>
> total unique fp went from 4 to 9 lost +125.00%
> mean fp % went from 0.418848167539 to 0.942408376964 lost +125.00%
>
> false negative percentages
> 1.571 0.000 won -100.00%
> 2.618 2.094 won -20.02%
> 1.571 0.524 won -66.65%
> 0.524 0.524 tied
> 1.571 1.047 won -33.35%
>
> won 4 times
> tied 1 times
> lost 0 times
>
> total unique fn went from 15 to 8 won -46.67%
> mean fn % went from 1.57068062827 to 0.83769633508 won -46.67%
>
> The histograms in the default scheme show the usual pattern, but the
> histograms with the changed parameters is like this:
Note that most people reported needing to boost spam_cutoff above 0.5 for
best results in this scheme. For example, your histograms show that
boosting it to 0.525 would cut 8 fp and add 10 fn, leaving a grand total of
9-8 = 1 fp and 8+10 = 18 fn. Boosting it to 0.55 would eliminate all your
fp, but add yet anoter 15 fn. Setting nbuckets higher than the default 40
would allow prediction of finer-grained changes.
Passive testing is helpful, but you have to *play* some if you want to help
tune a new approach. Like playing with spam_cutoff, trying
robinson_minimum_prob_strength: 0.1
or even higher values, and trying different values for
robinson_probability_a.