[Spambayes] Combining combining schemes

Tim Peters tim.one@comcast.net
Fri Oct 18 19:43:33 2002


I mentioned earlier that chi-combining and gary-combining have quite
different ideas about "how certain" they are on my extreme FP and FN.  So I
checked in some new options to allow us to play with that:

"""
[Classifier]
# Use a weighted average of chi-combining and gary-combining.
use_mixed_combining: False
mixed_combining_chi_weight: 0.9
"""

I ran my fat test just once (10-fold CV with 20,000 ham and 14,000 spam),
making parameters up off the top of my head:

"""
[Classifier]
use_mixed_combining: True
mixed_combining_chi_weight: 0.9

[TestDriver]
ham_cutoff:  0.10
spam_cutoff: 0.90
nbuckets: 200
"""

The bottom line is that this particular combination of settings removed
all(!) false negatives, left me with my 2 very hard FP, moved all other hard
ham very solidly into the middle ground, and had an unsure rate under 1%:

-> <stat> all runs false positives: 2
-> <stat> all runs false negatives: 0
-> <stat> all runs unsure: 226
-> <stat> all runs false positive %: 0.01
-> <stat> all runs false negative %: 0.0
-> <stat> all runs unsure %: 0.664705882353
-> <stat> all runs cost: $65.20

The histogram analysis found that it was possible to reduce the total middle
ground to 20 (out of 34,000!) messages at the cost of biting 3 FN:

-> best cost for all runs: $27.00
-> per-fp cost $10.00; per-fn cost $1.00; per-unsure cost $0.20
-> achieved at 3 cutoff pairs
-> smallest ham & spam cutoffs 0.5 & 0.75
->     fp 2; fn 3; unsure ham 12; unsure spam 8
->     fp rate 0.01%; fn rate 0.0214%; unsure rate 0.0588%
-> largest ham & spam cutoffs 0.5 & 0.76
->     fp 2; fn 3; unsure ham 12; unsure spam 8
->     fp rate 0.01%; fn rate 0.0214%; unsure rate 0.0588%

I can't make more time for this right now, but I think there's clearly
potential worth pursuing.

-> <stat> Ham scores for all runs: 20000 items; mean 2.81; sdev 2.92
-> <stat> min 0.121417; median 2.54101; max 96.5433
-> <stat> percentiles: 5% 1.68334; 25% 2.20207; 75% 2.89507; 95% 3.54761
* = 111 items
 0.0    6 *
 0.5   41 *
 1.0  420 ****
 1.5 2355 **********************
 2.0 6526 ***********************************************************
 2.5 6743 *************************************************************
 3.0 2789 **************************
 3.5  568 ******
 4.0  120 **
 4.5   71 *
 5.0   44 *
 5.5   21 *
 6.0   23 *
 6.5   14 *
 7.0   17 *
 7.5    8 *
 8.0    9 *
 8.5   12 *
 9.0    6 *
 9.5    4 *
10.0    7 *
10.5   10 *
11.0    7 *
11.5   10 *
12.0    7 *
12.5    9 *
13.0    5 *
13.5    3 *
14.0    3 *
14.5    4 *
15.0    3 *
15.5    7 *
16.0    3 *
16.5    7 *
17.0    1 *
17.5    5 *
18.0    4 *
18.5    0
19.0    0
19.5    4 *
20.0    3 *
20.5    3 *
21.0    2 *
21.5    3 *
22.0    1 *
22.5    2 *
23.0    1 *
23.5    3 *
24.0    2 *
24.5    3 *
25.0    0
25.5    1 *
26.0    5 *
26.5    0
27.0    2 *
27.5    3 *
28.0    3 *
28.5    1 *
29.0    3 *
29.5    1 *
30.0    1 *
30.5    0
31.0    1 *
31.5    2 *
32.0    0
32.5    2 *
33.0    3 *
33.5    1 *
34.0    2 *
34.5    0
35.0    0
35.5    1 *
36.0    2 *
36.5    2 *
37.0    1 *
37.5    2 *
38.0    0
38.5    2 *
39.0    1 *
39.5    1 *
40.0    2 *
40.5    1 *
41.0    1 *
41.5    0
42.0    2 *
42.5    2 *
43.0    1 *
43.5    0
44.0    2 *
44.5    0
45.0    0
45.5    2 *
46.0    0
46.5    1 *
47.0    2 *
47.5    0
48.0    0
48.5    2 *
49.0    3 *
49.5    3 *
50.0    1 *  A resume from a "an experienced
             engineer/mathematician/modeler who has built models and done
             computational mathematics in Python".
50.5    0
51.0    3 *  TOOLS Europe '99 conference announcement
             A word-free post kidy listing 3 URLs; we've argued before
                 about whether it's ham or spam; I think it's ham
             Someone posting a reply they got from MSN Hotmail Customer
                 support in response to a complaint about fetish porn
                 spam on c.l.py
51.5    0
52.0    0
52.5    0
53.0    0
53.5    0
54.0    1 *  "If you are interested in saving money ..."
54.5    0
55.0    0
55.5    0
56.0    0
56.5    0
57.0    0
57.5    0
58.0    0
58.5    0
59.0    0
59.5    1 *  questions about the job and real estate markets in France
60.0    1 *  HTML "Please unsubscribe me"
60.5    0
61.0    0
61.5    0
62.0    1 *  asking for advice on how to break into others' computers
62.5    0
63.0    0
63.5    0
64.0    0
64.5    0
65.0    0
65.5    0
66.0    0
66.5    0
67.0    0
67.5    0
68.0    0
68.5    0
69.0    1 *  long emotional msg the day after the 911 terrorist attack
69.5    0
70.0    0
70.5    0
71.0    0
71.5    1 *  Job announcement from Industrial Light & Magic.  Hurt
             in part because split-on-whitespace left "Python-savvy"
             as one word.
72.0    0
72.5    0
73.0    1 *  asking for help with a webmaster-ish program; it's in the
                 middle ground of both schemes:
             prob('*gary_score*') = 0.532758
             prob('*chi_score*') = 0.751966
73.5    0
74.0    0
74.5    1 *  inappropriate two-word "confirm 438765" followed by
                 "Get Your Private, Free E-mail from ..."
75.0    0
75.5    0
76.0    0
76.5    0
77.0    0
77.5    0
78.0    0
78.5    0
79.0    0
79.5    0
80.0    0
80.5    0
81.0    0
81.5    0
82.0    0
82.5    0
83.0    0
83.5    0
84.0    0
84.5    0
85.0    0
85.5    0
86.0    0
86.5    0
87.0    0
87.5    0
88.0    0
88.5    0
89.0    0
89.5    0
90.0    0
90.5    0
91.0    0
91.5    0
92.0    0
92.5    0
93.0    0
93.5    0
94.0    0
94.5    1 *  lady with the long, obnoxious employer-generated sig;
             gary-combining looks on this one much more kindly (but
             still outside a reasonable middle groud for it); chi is
             only slightly unsure
             prob('*gary_score*') = 0.597568
             prob('*chi_score*') = 0.986116
             prob('*H*') = 0.0277634
             prob('*S*') = 0.999996
             prob('*Q*') = 0.542133
             prob('*P*') = 0.805009
95.0    0
95.5    0
96.0    0
96.5    1 *  Nigerian scam quote
             gary-combining again has a much milder judgment, but
             chi is off the charts
             prob = 0.965433332477
             prob('*gary_score*') = 0.654334
             prob('*chi_score*') = 1
             prob('*H*') = 7.07788e-008
             prob('*S*') = 1
             prob('*Q*') = 0.466239
             prob('*P*') = 0.882573
97.0    0
97.5    0
98.0    0
98.5    0
99.0    0
99.5    0

-> <stat> Spam scores for all runs: 14000 items; mean 98.32; sdev 1.55
-> <stat> min 31.4614; median 98.3667; max 99.9601
-> <stat> percentiles: 5% 97.1931; 25% 97.9872; 75% 98.7541; 95% 99.657

Note that > 95% of spam scored higher than the Nigerian "ham"! (its score is
lower than spam's 5-percentile score)

* = 76 items
... [all 0] ...
30.5    0
31.0    1 *  "Hello, my Name is BlackIntrepid"
             prob = 0.314614377139
             prob('*gary_score*') = 0.480559
             prob('*chi_score*') = 0.296176
             prob('*H*') = 0.930885
             prob('*S*') = 0.523237
             prob('*Q*') = 0.684254
             prob('*P*') = 0.633036
31.5    0
32.0    0
32.5    0
33.0    0
33.5    1 *  uuencoded text body we throw away unlooked at
34.0    0
34.5    0
35.0    0
35.5    0
36.0    0
36.5    0
37.0    0
37.5    0
38.0    0
38.5    0
39.0    0
39.5    0
40.0    0
40.5    0
41.0    0
41.5    0
42.0    0
42.5    0
43.0    0
43.5    0
44.0    0
44.5    0
45.0    0
45.5    0
46.0    0
46.5    0
47.0    0
47.5    0
48.0    0
48.5    0
49.0    1 *  giant base64-encoded text file; gary- and chi- both score it
             near 0.50
49.5    0
50.0    1 *  Website Programmers Available Now!; full of tech talk
50.5    2 *  webmaster link directory
             the spam with dozens of killer spam clues hiding in
             meta tags we don't look at
51.0    0
51.5    0
52.0    0
52.5    0
53.0    0
53.5    0
54.0    0
54.5    0
55.0    0
55.5    0
56.0    0
56.5    0
57.0    0
57.5    0
58.0    1 *
58.5    0
59.0    0
59.5    0
60.0    0
60.5    0
61.0    0
61.5    0
62.0    0
62.5    0
63.0    1 *
63.5    0
64.0    0
64.5    0
65.0    0
65.5    0
66.0    0
66.5    1 *
67.0    0
67.5    0
68.0    0
68.5    1 *
69.0    0
69.5    0
70.0    0
70.5    0
71.0    0
71.5    0
72.0    0
72.5    0
73.0    0
73.5    1 *
74.0    0
74.5    0
75.0    0
75.5    0
76.0    1 *
76.5    0
77.0    0
77.5    0
78.0    1 *
78.5    0
79.0    0
79.5    0
80.0    0
80.5    0
81.0    1 *
81.5    0
82.0    1 *
82.5    1 *
83.0    1 *
83.5    0
84.0    0
84.5    1 *
85.0    2 *
85.5    0
86.0    0
86.5    0
87.0    0
87.5    0
88.0    0
88.5    1 *
89.0    3 *
89.5    1 *
90.0    2 *
90.5    1 *
91.0    1 *
91.5    0
92.0   16 *
92.5    3 *
93.0    3 *
93.5    2 *
94.0    2 *
94.5    6 *
95.0    6 *
95.5   20 *
96.0   76 *
96.5  269 ****
97.0  838 ************
97.5 2329 *******************************
98.0 4600 *************************************************************
98.5 3792 **************************************************
99.0 1045 **************
99.5  964 *************