[Spambayes] options.skip_max_word_size.
Tim Peters
tim.one@comcast.net
Mon Oct 28 17:01:15 2002
On skip_max_word_size, my c.l.py test, 10-fold CV, ham_cutoff=0.20 and
spam_cutoff=0.80:
-> <stat> tested 2000 hams & 1400 spams against 18000 hams & 12600 spams
[ditto]
filename: max12 max20
ham:spam: 20000:14000
20000:14000
fp total: 2 2 the same
fp %: 0.01 0.01
fn total: 0 0 the same
fn %: 0.00 0.00
unsure t: 103 100 slight decrease
unsure %: 0.30 0.29
real cost: $40.60 $40.00 slight improvement with these cutoffs
best cost: $27.00 $27.40 best possible got slightly worse
h mean: 0.28 0.27
h sdev: 2.99 2.92
s mean: 99.94 99.93
s sdev: 1.41 1.47
mean diff: 99.66 99.66
k: 22.65 22.70
"Best possible" in max20 would have been to boost ham_cutoff to 0.50(!), and
drop spam_cutoff a little to 0.78. This would have traded away most of the
unsures in return for letting 3 spam through:
-> smallest ham & spam cutoffs 0.5 & 0.78
-> fp 2; fn 3; unsure ham 11; unsure spam 11
-> fp rate 0.01%; fn rate 0.0214%; unsure rate 0.0647%
Best possible in max12 was much the same:
-> largest ham & spam cutoffs 0.5 & 0.78
-> fp 2; fn 3; unsure ham 12; unsure spam 8
-> fp rate 0.01%; fn rate 0.0214%; unsure rate 0.0588%
The classifier pickle size increased by about 1.5 MB (~8.4% bigger).
Anthony, you didn't respond to the question about whether you could have
gotten a similar improvement simply by changing cutoff values. The data you
posted showed a large decrease in unsures at the expense of a large boost in
your FN rate. It's quite plausible that exactly the same would have
happened if you raised ham_cutoff. See my results above, where boosting ham
cutoff from 0.20 to 0.50 would get rid of 80% of my unsures at the cost of
letting 3 (vs 0) spam thru.