[Spambayes] Combining combining schemes
Tim Peters
tim.one@comcast.net
Fri Oct 18 19:43:33 2002
I mentioned earlier that chi-combining and gary-combining have quite
different ideas about "how certain" they are on my extreme FP and FN. So I
checked in some new options to allow us to play with that:
"""
[Classifier]
# Use a weighted average of chi-combining and gary-combining.
use_mixed_combining: False
mixed_combining_chi_weight: 0.9
"""
I ran my fat test just once (10-fold CV with 20,000 ham and 14,000 spam),
making parameters up off the top of my head:
"""
[Classifier]
use_mixed_combining: True
mixed_combining_chi_weight: 0.9
[TestDriver]
ham_cutoff: 0.10
spam_cutoff: 0.90
nbuckets: 200
"""
The bottom line is that this particular combination of settings removed
all(!) false negatives, left me with my 2 very hard FP, moved all other hard
ham very solidly into the middle ground, and had an unsure rate under 1%:
-> <stat> all runs false positives: 2
-> <stat> all runs false negatives: 0
-> <stat> all runs unsure: 226
-> <stat> all runs false positive %: 0.01
-> <stat> all runs false negative %: 0.0
-> <stat> all runs unsure %: 0.664705882353
-> <stat> all runs cost: $65.20
The histogram analysis found that it was possible to reduce the total middle
ground to 20 (out of 34,000!) messages at the cost of biting 3 FN:
-> best cost for all runs: $27.00
-> per-fp cost $10.00; per-fn cost $1.00; per-unsure cost $0.20
-> achieved at 3 cutoff pairs
-> smallest ham & spam cutoffs 0.5 & 0.75
-> fp 2; fn 3; unsure ham 12; unsure spam 8
-> fp rate 0.01%; fn rate 0.0214%; unsure rate 0.0588%
-> largest ham & spam cutoffs 0.5 & 0.76
-> fp 2; fn 3; unsure ham 12; unsure spam 8
-> fp rate 0.01%; fn rate 0.0214%; unsure rate 0.0588%
I can't make more time for this right now, but I think there's clearly
potential worth pursuing.
-> <stat> Ham scores for all runs: 20000 items; mean 2.81; sdev 2.92
-> <stat> min 0.121417; median 2.54101; max 96.5433
-> <stat> percentiles: 5% 1.68334; 25% 2.20207; 75% 2.89507; 95% 3.54761
* = 111 items
0.0 6 *
0.5 41 *
1.0 420 ****
1.5 2355 **********************
2.0 6526 ***********************************************************
2.5 6743 *************************************************************
3.0 2789 **************************
3.5 568 ******
4.0 120 **
4.5 71 *
5.0 44 *
5.5 21 *
6.0 23 *
6.5 14 *
7.0 17 *
7.5 8 *
8.0 9 *
8.5 12 *
9.0 6 *
9.5 4 *
10.0 7 *
10.5 10 *
11.0 7 *
11.5 10 *
12.0 7 *
12.5 9 *
13.0 5 *
13.5 3 *
14.0 3 *
14.5 4 *
15.0 3 *
15.5 7 *
16.0 3 *
16.5 7 *
17.0 1 *
17.5 5 *
18.0 4 *
18.5 0
19.0 0
19.5 4 *
20.0 3 *
20.5 3 *
21.0 2 *
21.5 3 *
22.0 1 *
22.5 2 *
23.0 1 *
23.5 3 *
24.0 2 *
24.5 3 *
25.0 0
25.5 1 *
26.0 5 *
26.5 0
27.0 2 *
27.5 3 *
28.0 3 *
28.5 1 *
29.0 3 *
29.5 1 *
30.0 1 *
30.5 0
31.0 1 *
31.5 2 *
32.0 0
32.5 2 *
33.0 3 *
33.5 1 *
34.0 2 *
34.5 0
35.0 0
35.5 1 *
36.0 2 *
36.5 2 *
37.0 1 *
37.5 2 *
38.0 0
38.5 2 *
39.0 1 *
39.5 1 *
40.0 2 *
40.5 1 *
41.0 1 *
41.5 0
42.0 2 *
42.5 2 *
43.0 1 *
43.5 0
44.0 2 *
44.5 0
45.0 0
45.5 2 *
46.0 0
46.5 1 *
47.0 2 *
47.5 0
48.0 0
48.5 2 *
49.0 3 *
49.5 3 *
50.0 1 * A resume from a "an experienced
engineer/mathematician/modeler who has built models and done
computational mathematics in Python".
50.5 0
51.0 3 * TOOLS Europe '99 conference announcement
A word-free post kidy listing 3 URLs; we've argued before
about whether it's ham or spam; I think it's ham
Someone posting a reply they got from MSN Hotmail Customer
support in response to a complaint about fetish porn
spam on c.l.py
51.5 0
52.0 0
52.5 0
53.0 0
53.5 0
54.0 1 * "If you are interested in saving money ..."
54.5 0
55.0 0
55.5 0
56.0 0
56.5 0
57.0 0
57.5 0
58.0 0
58.5 0
59.0 0
59.5 1 * questions about the job and real estate markets in France
60.0 1 * HTML "Please unsubscribe me"
60.5 0
61.0 0
61.5 0
62.0 1 * asking for advice on how to break into others' computers
62.5 0
63.0 0
63.5 0
64.0 0
64.5 0
65.0 0
65.5 0
66.0 0
66.5 0
67.0 0
67.5 0
68.0 0
68.5 0
69.0 1 * long emotional msg the day after the 911 terrorist attack
69.5 0
70.0 0
70.5 0
71.0 0
71.5 1 * Job announcement from Industrial Light & Magic. Hurt
in part because split-on-whitespace left "Python-savvy"
as one word.
72.0 0
72.5 0
73.0 1 * asking for help with a webmaster-ish program; it's in the
middle ground of both schemes:
prob('*gary_score*') = 0.532758
prob('*chi_score*') = 0.751966
73.5 0
74.0 0
74.5 1 * inappropriate two-word "confirm 438765" followed by
"Get Your Private, Free E-mail from ..."
75.0 0
75.5 0
76.0 0
76.5 0
77.0 0
77.5 0
78.0 0
78.5 0
79.0 0
79.5 0
80.0 0
80.5 0
81.0 0
81.5 0
82.0 0
82.5 0
83.0 0
83.5 0
84.0 0
84.5 0
85.0 0
85.5 0
86.0 0
86.5 0
87.0 0
87.5 0
88.0 0
88.5 0
89.0 0
89.5 0
90.0 0
90.5 0
91.0 0
91.5 0
92.0 0
92.5 0
93.0 0
93.5 0
94.0 0
94.5 1 * lady with the long, obnoxious employer-generated sig;
gary-combining looks on this one much more kindly (but
still outside a reasonable middle groud for it); chi is
only slightly unsure
prob('*gary_score*') = 0.597568
prob('*chi_score*') = 0.986116
prob('*H*') = 0.0277634
prob('*S*') = 0.999996
prob('*Q*') = 0.542133
prob('*P*') = 0.805009
95.0 0
95.5 0
96.0 0
96.5 1 * Nigerian scam quote
gary-combining again has a much milder judgment, but
chi is off the charts
prob = 0.965433332477
prob('*gary_score*') = 0.654334
prob('*chi_score*') = 1
prob('*H*') = 7.07788e-008
prob('*S*') = 1
prob('*Q*') = 0.466239
prob('*P*') = 0.882573
97.0 0
97.5 0
98.0 0
98.5 0
99.0 0
99.5 0
-> <stat> Spam scores for all runs: 14000 items; mean 98.32; sdev 1.55
-> <stat> min 31.4614; median 98.3667; max 99.9601
-> <stat> percentiles: 5% 97.1931; 25% 97.9872; 75% 98.7541; 95% 99.657
Note that > 95% of spam scored higher than the Nigerian "ham"! (its score is
lower than spam's 5-percentile score)
* = 76 items
... [all 0] ...
30.5 0
31.0 1 * "Hello, my Name is BlackIntrepid"
prob = 0.314614377139
prob('*gary_score*') = 0.480559
prob('*chi_score*') = 0.296176
prob('*H*') = 0.930885
prob('*S*') = 0.523237
prob('*Q*') = 0.684254
prob('*P*') = 0.633036
31.5 0
32.0 0
32.5 0
33.0 0
33.5 1 * uuencoded text body we throw away unlooked at
34.0 0
34.5 0
35.0 0
35.5 0
36.0 0
36.5 0
37.0 0
37.5 0
38.0 0
38.5 0
39.0 0
39.5 0
40.0 0
40.5 0
41.0 0
41.5 0
42.0 0
42.5 0
43.0 0
43.5 0
44.0 0
44.5 0
45.0 0
45.5 0
46.0 0
46.5 0
47.0 0
47.5 0
48.0 0
48.5 0
49.0 1 * giant base64-encoded text file; gary- and chi- both score it
near 0.50
49.5 0
50.0 1 * Website Programmers Available Now!; full of tech talk
50.5 2 * webmaster link directory
the spam with dozens of killer spam clues hiding in
meta tags we don't look at
51.0 0
51.5 0
52.0 0
52.5 0
53.0 0
53.5 0
54.0 0
54.5 0
55.0 0
55.5 0
56.0 0
56.5 0
57.0 0
57.5 0
58.0 1 *
58.5 0
59.0 0
59.5 0
60.0 0
60.5 0
61.0 0
61.5 0
62.0 0
62.5 0
63.0 1 *
63.5 0
64.0 0
64.5 0
65.0 0
65.5 0
66.0 0
66.5 1 *
67.0 0
67.5 0
68.0 0
68.5 1 *
69.0 0
69.5 0
70.0 0
70.5 0
71.0 0
71.5 0
72.0 0
72.5 0
73.0 0
73.5 1 *
74.0 0
74.5 0
75.0 0
75.5 0
76.0 1 *
76.5 0
77.0 0
77.5 0
78.0 1 *
78.5 0
79.0 0
79.5 0
80.0 0
80.5 0
81.0 1 *
81.5 0
82.0 1 *
82.5 1 *
83.0 1 *
83.5 0
84.0 0
84.5 1 *
85.0 2 *
85.5 0
86.0 0
86.5 0
87.0 0
87.5 0
88.0 0
88.5 1 *
89.0 3 *
89.5 1 *
90.0 2 *
90.5 1 *
91.0 1 *
91.5 0
92.0 16 *
92.5 3 *
93.0 3 *
93.5 2 *
94.0 2 *
94.5 6 *
95.0 6 *
95.5 20 *
96.0 76 *
96.5 269 ****
97.0 838 ************
97.5 2329 *******************************
98.0 4600 *************************************************************
98.5 3792 **************************************************
99.0 1045 **************
99.5 964 *************