[Spambayes] Use for gray area in scoring range

Sun, 22 Sep 2002 02:39:51 -0400

[Guido]
> ...
> I haven't found the right setting for me yet.  0.575 did better than
> 0.55 but still much worse on the fps than Graham.

You don't have to run more tests for this:  the effect of any particular
setting can be determined from looking at your score histograms (provided
you stick to settings at the bucket boundaries; you can set nbuckets in the
options to a larger number to get finer-grained histograms).

Note too that setting the new robinson_minimum_prob_strength option has
given a strong reduction in f-n rate for the two people who have reported on
it.

>> It's great to have the knob, but it's sensitive, and so far we've no
>> idea how to choose it short of trial and error (it's easy to choose
>> if you've got the score histograms to stare at, but end users
>> won't).

> Plus, we don't know why it's not 50, right?

I covered that in a later message:  no, but the most likely reason seems
simply that the spam mean is farther from 50 than the ham mean.

> Might that have to do with the spam/ham ration?  I've got 83 hams for
> each 32 spams.

Run an experiment!  At least timcv.py has cmdline options to allow running
on random subsets, and allows specifying different counts for the ham
subsets than the spam subsets.  This allows testing on any ratio you like.

I expect that playing with robinson_probability_a will also affect this.
Setting that to 0 will make words with very low corpus counts act much more
like they did under the all-default scheme (i.e., it will give them extreme
probabilities, which can help if, e.g., you've only got one message from
your favorite porn vendor in your training set).