[Spambayes] new option: generate_long_skips
Skip Montanaro
skip@pobox.com
Mon, 30 Sep 2002 17:43:38 -0500
> I just checked in a new option for the tokenizer: generate_long_skips.
...
> I am currently running a test with 10 sets of 200 messages per set.
Almost exactly the same:
cutoffs -> noskipss
-> <stat> tested 200 hams & 200 spams against 1800 hams & 1800 spams
-> <stat> tested 200 hams & 200 spams against 1800 hams & 1800 spams
-> <stat> tested 200 hams & 200 spams against 1800 hams & 1800 spams
-> <stat> tested 200 hams & 200 spams against 1800 hams & 1800 spams
-> <stat> tested 200 hams & 200 spams against 1800 hams & 1800 spams
-> <stat> tested 200 hams & 200 spams against 1800 hams & 1800 spams
-> <stat> tested 200 hams & 200 spams against 1800 hams & 1800 spams
-> <stat> tested 200 hams & 200 spams against 1800 hams & 1800 spams
-> <stat> tested 200 hams & 200 spams against 1800 hams & 1800 spams
-> <stat> tested 200 hams & 200 spams against 1800 hams & 1800 spams
-> <stat> tested 200 hams & 200 spams against 1800 hams & 1800 spams
-> <stat> tested 200 hams & 200 spams against 1800 hams & 1800 spams
-> <stat> tested 200 hams & 200 spams against 1800 hams & 1800 spams
-> <stat> tested 200 hams & 200 spams against 1800 hams & 1800 spams
-> <stat> tested 200 hams & 200 spams against 1800 hams & 1800 spams
-> <stat> tested 200 hams & 200 spams against 1800 hams & 1800 spams
-> <stat> tested 200 hams & 200 spams against 1800 hams & 1800 spams
-> <stat> tested 200 hams & 200 spams against 1800 hams & 1800 spams
-> <stat> tested 200 hams & 200 spams against 1800 hams & 1800 spams
-> <stat> tested 200 hams & 200 spams against 1800 hams & 1800 spams
false positive percentages
1.000 1.000 tied
1.500 1.500 tied
1.000 1.000 tied
1.000 1.000 tied
1.000 1.000 tied
1.500 1.000 won -33.33%
3.500 3.500 tied
1.500 1.500 tied
1.500 1.500 tied
1.500 1.500 tied
won 1 times
tied 9 times
lost 0 times
total unique fp went from 30 to 29 won -3.33%
mean fp % went from 1.5 to 1.45 won -3.33%
false negative percentages
0.500 0.500 tied
1.500 1.500 tied
0.500 0.500 tied
0.500 1.000 lost +100.00%
2.000 2.000 tied
0.000 0.000 tied
1.000 1.000 tied
1.000 1.000 tied
0.000 0.000 tied
1.500 1.500 tied
won 0 times
tied 9 times
lost 1 times
total unique fn went from 17 to 18 lost +5.88%
mean fn % went from 0.85 to 0.9 lost +5.88%
ham mean ham sdev
20.82 20.59 -1.10% 6.43 6.18 -3.89%
21.86 21.66 -0.91% 6.63 6.26 -5.58%
21.38 21.26 -0.56% 6.49 6.30 -2.93%
21.96 21.79 -0.77% 6.26 6.24 -0.32%
21.51 21.29 -1.02% 6.72 6.62 -1.49%
21.66 21.43 -1.06% 6.98 6.84 -2.01%
21.45 21.32 -0.61% 7.66 7.65 -0.13%
21.74 21.51 -1.06% 6.69 6.64 -0.75%
21.71 21.49 -1.01% 7.44 7.21 -3.09%
21.87 21.73 -0.64% 5.93 5.92 -0.17%
ham mean and sdev for all runs
21.60 21.41 -0.88% 6.75 6.61 -2.07%
spam mean spam sdev
74.10 73.56 -0.73% 12.99 13.13 +1.08%
72.47 71.74 -1.01% 13.92 13.78 -1.01%
74.05 73.52 -0.72% 13.00 13.07 +0.54%
74.00 73.54 -0.62% 12.27 12.16 -0.90%
72.43 71.91 -0.72% 13.73 13.52 -1.53%
72.68 72.24 -0.61% 13.27 13.26 -0.08%
72.57 71.84 -1.01% 13.03 12.99 -0.31%
71.50 71.30 -0.28% 12.12 12.14 +0.17%
73.25 72.68 -0.78% 12.67 12.53 -1.10%
73.02 72.70 -0.44% 12.44 12.43 -0.08%
spam mean and sdev for all runs
73.01 72.50 -0.70% 12.98 12.94 -0.31%
ham/spam mean difference: 51.41 51.09 -0.32
I notice it's suggesting an even lower cutoff now (0.375).
Before:
-> best cutoff for all runs: 0.4
-> with weighted total 1*30 fp + 17 fn = 47
-> fp rate 1.5% fn rate 0.85%
After:
-> best cutoff for all runs: 0.375
-> with weighted total 1*35 fp + 7 fn = 42
-> fp rate 1.75% fn rate 0.35%
Skip