Bland words, and z-combining (was RE: [Spambayes] Bland word only score..)

Tim Peters tim.one@comcast.net
Mon, 14 Oct 2002 23:54:51 -0400


FYI, I doubled the number of accurate digits in z-combining's probability ->
zscore calculations.  This made it even more extreme for me -- the median
ham score fell to 0 on the nose.  The good news is that my lowest-scoring
spam's score rose, from

    4.09672e-012

to

    4.10227e-012

Take *that* to the bank <wink>.

-> best cost for all runs: $26.80
-> per-fp cost $10.00; per-fn cost $1.00; per-unsure cost $0.20
-> achieved at 114 cutoff pairs
-> smallest ham & spam cutoffs 0.63 & 0.944
->     fp 2; fn 3; unsure ham 10; unsure spam 9
->     fp rate 0.01%; fn rate 0.0214%; unsure rate 0.0559%
-> largest ham & spam cutoffs 0.868 & 0.946
->     fp 2; fn 5; unsure ham 2; unsure spam 7
->     fp rate 0.01%; fn rate 0.0357%; unsure rate 0.0265%

That's the first run of any kind I've seen where the minimum cost could be
achieved in more than one way.  I don't mean that there were 114 cutoff
pairs that achieved it (that's normal enough), but that the two specific
endpoints shown there make different tradeoffs between FN and unsures.

What this doesn't show is that picking cutoffs of 0.05 and 0.95 would have
been almost as cheap -- getting *close* to the minimum isn't touchy at all,
but getting the absolute minimum is.