[Spambayes] options.skip_max_word_size.

Mon Oct 28 17:01:15 2002

On skip_max_word_size, my c.l.py test, 10-fold CV, ham_cutoff=0.20 and
spam_cutoff=0.80:

-> <stat> tested 2000 hams & 1400 spams against 18000 hams & 12600 spams
[ditto]

filename:    max12   max20
ham:spam:  20000:14000
                   20000:14000
fp total:        2       2       the same
fp %:         0.01    0.01
fn total:        0       0       the same
fn %:         0.00    0.00
unsure t:      103     100       slight decrease
unsure %:     0.30    0.29
real cost:  $40.60  $40.00       slight improvement with these cutoffs
best cost:  $27.00  $27.40       best possible got slightly worse
h mean:       0.28    0.27
h sdev:       2.99    2.92
s mean:      99.94   99.93
s sdev:       1.41    1.47
mean diff:   99.66   99.66
k:           22.65   22.70

"Best possible" in max20 would have been to boost ham_cutoff to 0.50(!), and
drop spam_cutoff a little to 0.78.  This would have traded away most of the
unsures in return for letting 3 spam through:

-> smallest ham & spam cutoffs 0.5 & 0.78
->     fp 2; fn 3; unsure ham 11; unsure spam 11
->     fp rate 0.01%; fn rate 0.0214%; unsure rate 0.0647%

Best possible in max12 was much the same:

-> largest ham & spam cutoffs 0.5 & 0.78
->     fp 2; fn 3; unsure ham 12; unsure spam 8
->     fp rate 0.01%; fn rate 0.0214%; unsure rate 0.0588%

The classifier pickle size increased by about 1.5 MB (~8.4% bigger).

Anthony, you didn't respond to the question about whether you could have
gotten a similar improvement simply by changing cutoff values.  The data you
posted showed a large decrease in unsures at the expense of a large boost in
your FN rate.  It's quite plausible that exactly the same would have
happened if you raised ham_cutoff.  See my results above, where boosting ham
cutoff from 0.20 to 0.50 would get rid of 80% of my unsures at the cost of
letting 3 (vs 0) spam thru.