[Spambayes] Here's why "generate_long_skips: False" worked...
Neil Schemenauer
nas@python.ca
Mon, 30 Sep 2002 20:21:00 -0700
Tim Peters wrote:
> [Neil Schemenauer]
> > I tried generating 2 character-grams when has_highbit_char was true.
>
> In addition to, or in lieu of, generating skip tokens?
In addition.
> 1. Current vs doing character 2-grams when has_highbit_char is true
> instead of generating skip tokens.
Left is current:
false positive percentages
0.000 0.000 tied
1.000 1.000 tied
0.000 0.000 tied
0.000 0.000 tied
0.000 0.000 tied
0.500 0.500 tied
0.500 0.500 tied
0.000 0.000 tied
0.500 0.500 tied
0.000 0.000 tied
won 0 times
tied 10 times
lost 0 times
total unique fp went from 5 to 5 tied
mean fp % went from 0.25 to 0.25 tied
false negative percentages
0.000 0.000 tied
1.000 1.000 tied
1.000 1.000 tied
0.500 0.500 tied
1.500 1.500 tied
1.500 1.500 tied
0.500 0.500 tied
0.500 0.500 tied
1.000 1.000 tied
0.000 0.000 tied
won 0 times
tied 10 times
lost 0 times
total unique fn went from 15 to 15 tied
mean fn % went from 0.75 to 0.75 tied
ham mean ham sdev
27.66 27.62 -0.14% 8.52 8.51 -0.12%
26.51 26.47 -0.15% 8.75 8.79 +0.46%
25.82 25.76 -0.23% 7.92 7.91 -0.13%
27.03 27.00 -0.11% 8.22 8.28 +0.73%
26.95 26.88 -0.26% 8.21 8.26 +0.61%
29.23 29.19 -0.14% 9.28 9.27 -0.11%
27.25 27.20 -0.18% 8.15 8.16 +0.12%
26.89 26.83 -0.22% 7.88 7.89 +0.13%
27.02 26.93 -0.33% 9.02 8.99 -0.33%
26.63 26.57 -0.23% 7.20 7.18 -0.28%
ham mean and sdev for all runs
27.10 27.05 -0.18% 8.38 8.39 +0.12%
spam mean spam sdev
81.73 82.38 +0.80% 10.24 10.96 +7.03%
80.90 81.56 +0.82% 10.16 10.96 +7.87%
80.03 81.11 +1.35% 9.99 11.02 +10.31%
81.51 82.48 +1.19% 10.28 11.29 +9.82%
81.44 82.31 +1.07% 10.43 11.13 +6.71%
81.11 82.17 +1.31% 9.82 10.87 +10.69%
80.64 81.69 +1.30% 9.52 10.47 +9.98%
80.43 81.48 +1.31% 9.84 10.74 +9.15%
81.18 82.02 +1.03% 10.25 10.91 +6.44%
81.17 82.59 +1.75% 9.90 11.10 +12.12%
spam mean and sdev for all runs
81.01 81.98 +1.20% 10.06 10.96 +8.95%
ham/spam mean difference: 53.91 54.93 +1.02
>
> 2. Current vs doing character 2-grams when has_highbit_char is true
> in addition to generating skip tokens.
Again, left is current:
false positive percentages
0.000 0.000 tied
1.000 1.000 tied
0.000 0.000 tied
0.000 0.000 tied
0.000 0.000 tied
0.500 0.500 tied
0.500 0.500 tied
0.000 0.000 tied
0.500 0.500 tied
0.000 0.000 tied
won 0 times
tied 10 times
lost 0 times
total unique fp went from 5 to 5 tied
mean fp % went from 0.25 to 0.25 tied
false negative percentages
0.000 0.000 tied
1.000 1.000 tied
1.000 1.000 tied
0.500 0.500 tied
1.500 1.500 tied
1.500 1.500 tied
0.500 0.500 tied
0.500 0.500 tied
1.000 1.000 tied
0.000 0.000 tied
won 0 times
tied 10 times
lost 0 times
total unique fn went from 15 to 15 tied
mean fn % went from 0.75 to 0.75 tied
ham mean ham sdev
27.66 27.66 +0.00% 8.52 8.52 +0.00%
26.51 26.52 +0.04% 8.75 8.79 +0.46%
25.82 25.82 +0.00% 7.92 7.92 +0.00%
27.03 27.06 +0.11% 8.22 8.28 +0.73%
26.95 26.96 +0.04% 8.21 8.25 +0.49%
29.23 29.23 +0.00% 9.28 9.28 +0.00%
27.25 27.26 +0.04% 8.15 8.16 +0.12%
26.89 26.89 +0.00% 7.88 7.88 +0.00%
27.02 27.02 +0.00% 9.02 9.02 +0.00%
26.63 26.63 +0.00% 7.20 7.20 +0.00%
ham mean and sdev for all runs
27.10 27.10 +0.00% 8.38 8.39 +0.12%
spam mean spam sdev
81.73 82.51 +0.95% 10.24 11.00 +7.42%
80.90 81.66 +0.94% 10.16 10.98 +8.07%
80.03 81.24 +1.51% 9.99 11.18 +11.91%
81.51 82.58 +1.31% 10.28 11.35 +10.41%
81.44 82.38 +1.15% 10.43 11.17 +7.09%
81.11 82.29 +1.45% 9.82 10.91 +11.10%
80.64 81.78 +1.41% 9.52 10.48 +10.08%
80.43 81.57 +1.42% 9.84 10.80 +9.76%
81.18 82.13 +1.17% 10.25 10.96 +6.93%
81.17 82.71 +1.90% 9.90 11.22 +13.33%
spam mean and sdev for all runs
81.01 82.09 +1.33% 10.06 11.02 +9.54%
ham/spam mean difference: 53.91 54.99 +1.08