[Spambayes] There Can Be Only One
Neil Schemenauer
nas@python.ca
Tue, 24 Sep 2002 20:55:48 -0700
Here's my first result (Graham on left):
false positive percentages
0.000 0.500 lost +(was 0)
1.500 1.500 tied
0.000 0.000 tied
0.000 0.000 tied
0.500 0.500 tied
0.500 1.000 lost +100.00%
0.500 0.500 tied
0.500 0.000 won -100.00%
0.000 0.500 lost +(was 0)
0.000 0.500 lost +(was 0)
won 1 times
tied 5 times
lost 4 times
total unique fp went from 7 to 10 lost +42.86%
mean fp % went from 0.35 to 0.5 lost +42.86%
false negative percentages
0.500 0.000 won -100.00%
2.000 2.000 tied
1.000 0.500 won -50.00%
1.000 0.500 won -50.00%
2.000 1.500 won -25.00%
1.000 1.000 tied
0.500 0.500 tied
2.000 0.000 won -100.00%
1.500 0.500 won -66.67%
0.000 0.000 tied
won 6 times
tied 4 times
lost 0 times
total unique fn went from 23 to 13 won -43.48%
mean fn % went from 1.15 to 0.65 won -43.48%
Already Robinson seems to be doing better. The only tweak I made was to
set spam_cutoff to 0.580. Next, I tried setting robinson_probability_a
to 0.2 (compared to last Robinson result):
total unique fp went from 10 to 6 won -40.00%
mean fp % went from 0.5 to 0.3 won -40.00%
total unique fn went from 13 to 14 lost +7.69%
mean fn % went from 0.65 to 0.7 lost +7.69%
That's another win in overall error rate. Setting
robinson_probability_a to 0.05 made things worse (again compared best
result):
total unique fp went from 6 to 5 won -16.67%
mean fp % went from 0.3 to 0.25 won -16.67%
total unique fn went from 14 to 25 lost +78.57%
mean fn % went from 0.7 to 1.25 lost +78.57%
I could probably do better by adjusting the cutoff but not better than
a = 0.2. Now, a = 0.1:
total unique fp went from 6 to 5 won -16.67%
mean fp % went from 0.3 to 0.25 won -16.67%
total unique fn went from 14 to 17 lost +21.43%
mean fn % went from 0.7 to 0.85 lost +21.43%
Again, I could probably do a little better by lowering the cutoff but I
think a = 0.2 is still better. Just for kicks, a = 0.8:
total unique fp went from 6 to 11 lost +83.33%
mean fp % went from 0.3 to 0.55 lost +83.33%
total unique fn went from 14 to 18 lost +28.57%
mean fn % went from 0.7 to 0.9 lost +28.57%
So 'a' between 0.1 and 0.2 seems best. Sticking with a = 0.2 and
lowering robinson_minimum_prob_strength to 0.0 (lowering cutoff to 0.56
to hit the sweet spot of the distribution):
total unique fp went from 6 to 12 lost +100.00%
mean fp % went from 0.3 to 0.6 lost +100.00%
total unique fn went from 14 to 16 lost +14.29%
mean fn % went from 0.7 to 0.8 lost +14.29%
It seems robinson_minimum_prob_strength > 0 is useful. Now, increase it
to 0.2:
false positive percentages
0.000 0.500 lost +(was 0)
1.000 2.000 lost +100.00%
0.000 0.000 tied
0.000 0.000 tied
0.500 0.500 tied
0.500 1.000 lost +100.00%
0.500 0.500 tied
0.000 0.500 lost +(was 0)
0.500 1.500 lost +200.00%
0.000 0.500 lost +(was 0)
won 0 times
tied 4 times
lost 6 times
total unique fp went from 6 to 14 lost +133.33%
mean fp % went from 0.3 to 0.7 lost +133.33%
false negative percentages
0.000 0.000 tied
1.000 1.000 tied
1.000 0.000 won -100.00%
0.500 0.500 tied
2.000 1.500 won -25.00%
1.500 1.000 won -33.33%
0.000 0.000 tied
0.500 0.000 won -100.00%
0.500 0.000 won -100.00%
0.000 0.000 tied
won 5 times
tied 5 times
lost 0 times
total unique fn went from 14 to 8 won -42.86%
mean fn % went from 0.7 to 0.4 won -42.86%
I could probably do slightly better by adjusting the cuttoff but I think
0.1 is still better. I tried increasing max_discriminators to 1500
(with robinson_minimum_prob_strength back to 0.1 and cuttoff back to
0.58). This is another close one:
false positive percentages
0.000 0.000 tied
1.000 1.000 tied
0.000 0.000 tied
0.000 0.000 tied
0.500 0.500 tied
0.500 0.500 tied
0.500 0.500 tied
0.000 0.000 tied
0.500 0.500 tied
0.000 0.000 tied
won 0 times
tied 10 times
lost 0 times
total unique fp went from 6 to 6 tied
mean fp % went from 0.3 to 0.3 tied
false negative percentages
0.000 0.000 tied
1.000 1.000 tied
1.000 1.000 tied
0.500 0.500 tied
2.000 2.000 tied
1.500 1.500 tied
0.000 0.500 lost +(was 0)
0.500 0.500 tied
0.500 0.500 tied
0.000 0.000 tied
won 0 times
tied 9 times
lost 1 times
total unique fn went from 14 to 15 lost +7.14%
mean fn % went from 0.7 to 0.75 lost +7.14%
I'd say it's a toss-up. Now, max_discriminators = 70 (another close
one):
false positive percentages
0.000 0.000 tied
1.000 0.500 won -50.00%
0.000 0.000 tied
0.000 0.000 tied
0.500 0.500 tied
0.500 0.500 tied
0.500 0.500 tied
0.000 0.000 tied
0.500 0.500 tied
0.000 0.500 lost +(was 0)
won 1 times
tied 8 times
lost 1 times
total unique fp went from 6 to 6 tied
mean fp % went from 0.3 to 0.3 tied
false negative percentages
0.000 0.000 tied
1.000 1.000 tied
1.000 1.000 tied
0.500 0.500 tied
2.000 2.000 tied
1.500 1.000 won -33.33%
0.000 0.000 tied
0.500 0.500 tied
0.500 0.500 tied
0.000 0.000 tied
won 1 times
tied 9 times
lost 0 times
total unique fn went from 14 to 13 won -7.14%
mean fn % went from 0.7 to 0.65 won -7.14%
Hmmm, a little better perhaps? Hard to say. Let's try 50 and compare
it to 150:
false positive percentages
0.000 0.000 tied
1.000 1.000 tied
0.000 0.000 tied
0.000 0.000 tied
0.500 0.500 tied
0.500 0.500 tied
0.500 0.500 tied
0.000 0.500 lost +(was 0)
0.500 0.500 tied
0.000 0.000 tied
won 0 times
tied 9 times
lost 1 times
total unique fp went from 6 to 7 lost +16.67%
mean fp % went from 0.3 to 0.35 lost +16.67%
false negative percentages
0.000 0.000 tied
1.000 1.000 tied
1.000 0.500 won -50.00%
0.500 0.500 tied
2.000 2.000 tied
1.500 1.500 tied
0.000 0.000 tied
0.500 0.500 tied
0.500 1.000 lost +100.00%
0.000 0.000 tied
won 1 times
tied 8 times
lost 1 times
total unique fn went from 14 to 14 tied
mean fn % went from 0.7 to 0.7 tied
It seems to be no better. Finally, I tried changing the random seed and
retesting with an empty bayescustomize.ini and with the settings that
gave me the best result with "Robinson":
false positive percentages
0.000 0.000 tied
1.500 1.000 won -33.33%
0.000 0.000 tied
0.000 0.000 tied
0.500 0.500 tied
0.500 0.500 tied
0.500 0.500 tied
0.500 0.500 tied
0.000 0.500 lost +(was 0)
0.000 0.000 tied
won 1 times
tied 8 times
lost 1 times
total unique fp went from 7 to 7 tied
mean fp % went from 0.35 to 0.35 tied
false negative percentages
0.500 0.000 won -100.00%
2.000 1.000 won -50.00%
1.000 0.500 won -50.00%
1.000 0.500 won -50.00%
2.000 2.000 tied
1.000 1.500 lost +50.00%
0.500 0.000 won -100.00%
2.000 0.500 won -75.00%
1.500 1.000 won -33.33%
0.000 0.000 tied
won 7 times
tied 2 times
lost 1 times
total unique fn went from 23 to 14 won -39.13%
mean fn % went from 1.15 to 0.7 won -39.13%
Well, that's all for me for now. In summary, the best
result was with:
"""
[Classifier]
use_robinson_combining: True
use_robinson_probability: True
robinson_probability_x: 0.5
robinson_probability_a: 0.2
max_discriminators: 150
robinson_minimum_prob_strength: 0.1
[TestDriver]
SPAM_cutoff: 0.580
"""
HTH,
Neil