[Spambayes] options.skip_max_word_size.
Anthony Baxter
anthony@interlink.com.au
Mon Oct 28 07:08:01 2002
I noticed a bunch of really nice ham clues were getting skipped in some
of my personal email's 'unsure' bucket. They were words like 'interconnection'
and other longer techie-words. I added an option skip_max_word_size and
tried boosting it to 20 (from the default of 12).
cmp.py shows this (skip_max_word_size 12 on left, 20 on right)
false positive percentages
0.000 0.000 tied
0.000 0.000 tied
0.000 0.000 tied
0.000 0.000 tied
won 0 times
tied 4 times
lost 0 times
total unique fp went from 0 to 0 tied
mean fp % went from 0.0 to 0.0 tied
false negative percentages
0.000 0.277 lost +(was 0)
0.559 0.838 lost +49.91%
0.836 1.114 lost +33.25%
0.279 0.836 lost +199.64%
won 0 times
tied 0 times
lost 4 times
total unique fn went from 6 to 11 lost +83.33%
mean fn % went from 0.418410855729 to 0.766214465324 lost +83.12%
ham mean ham sdev
0.67 0.58 -13.43% 4.54 3.89 -14.32%
0.45 0.38 -15.56% 2.64 2.24 -15.15%
0.68 0.67 -1.47% 4.44 4.57 +2.93%
0.48 0.45 -6.25% 3.52 3.47 -1.42%
ham mean and sdev for all runs
0.57 0.52 -8.77% 3.86 3.64 -5.70%
spam mean spam sdev
98.30 98.76 +0.47% 8.61 7.95 -7.67%
97.47 97.61 +0.14% 10.67 10.78 +1.03%
98.51 98.43 -0.08% 9.13 10.93 +19.72%
97.58 97.27 -0.32% 10.90 12.08 +10.83%
spam mean and sdev for all runs
97.97 98.02 +0.05% 9.88 10.56 +6.88%
ham/spam mean difference: 97.40 97.50 +0.10
Unfortunately, cmp.py skips the important bit. My 'unsure' numbers
went from 164 to 135!
I'm not sure if this is just something that's an artifact of my
own data, or more general - if others could try it as well, it
would be good.
Anthony