[Spambayes] Moving closer to Gary's ideal

Tim Peters tim.one@comcast.net
Mon, 23 Sep 2002 13:24:30 -0400


[Sjoerd Mullender]
> > """
> > [Classifier]
> > use_robinson_probability: True
> > use_robinson_combining: True
> > max_discriminators: 1500
> >
> > [TestDriver]
> > spam_cutoff: 0.50
> > """
>
> I tested this against the default options (except I have
> count_all_header_lines: True and mine_received_headers: True
> permanently)

I'm jealous!  If you can use those, you can probably enable Jeremy's
basic_header_tokenize too.

> and got these results:
>
> false positive percentages
>     0.524  1.047  lost   +99.81%
>     0.000  0.524  lost  +(was 0)
>     0.524  0.524  tied
>     0.524  1.047  lost   +99.81%
>     0.524  1.571  lost  +199.81%
>
> won   0 times
> tied  1 times
> lost  4 times
>
> total unique fp went from 4 to 9 lost  +125.00%
> mean fp % went from 0.418848167539 to 0.942408376964 lost  +125.00%
>
> false negative percentages
>     1.571  0.000  won   -100.00%
>     2.618  2.094  won    -20.02%
>     1.571  0.524  won    -66.65%
>     0.524  0.524  tied
>     1.571  1.047  won    -33.35%
>
> won   4 times
> tied  1 times
> lost  0 times
>
> total unique fn went from 15 to 8 won    -46.67%
> mean fn % went from 1.57068062827 to 0.83769633508 won    -46.67%
>
> The histograms in the default scheme show the usual pattern, but the
> histograms with the changed parameters is like this:

Note that most people reported needing to boost spam_cutoff above 0.5 for
best results in this scheme.  For example, your histograms show that
boosting it to 0.525 would cut 8 fp and add 10 fn, leaving a grand total of
9-8 = 1 fp and 8+10 = 18 fn.  Boosting it to 0.55 would eliminate all your
fp, but add yet anoter 15 fn.  Setting nbuckets higher than the default 40
would allow prediction of finer-grained changes.

Passive testing is helpful, but you have to *play* some if you want to help
tune a new approach.  Like playing with spam_cutoff, trying

robinson_minimum_prob_strength: 0.1

or even higher values, and trying different values for
robinson_probability_a.