[Spambayes]
Testing against someone else's corpora (Was: There Can Be Only One)
Neale Pickett
neale@woozle.org
Mon Oct 21 20:21:20 2002
I bet you thought I'd forgotten about this :)
So then, Tim Peters <tim.one@comcast.net> is all like:
> [TIm]
> >> 3. Is it possible to "seed" a database with somebody else's data and
> >> get decent results out of the box?
>
> [Neale Pickett]
> > $FIRM has a tangible interest in the answer to this question.
[snip]
> So I'd build a custom test driver on top of TestDriver, like so:
>
> d = TestDriver.Driver()
> d.train(ham, spam) # create the seed database
> for user in users:
> d.test(user.ham, user.spam)
> d.finishtest()
> d.alldone()
[snip]
> The output will display results for each user individually, and an aggregate
> across all users. Then you'll want to stare at the output to see how well
> it does. Come back when you get that far <wink>.
Okay. Here's my test setup.
I have been collecting all the spam sent to $FIRM for the past week and
a half. I'm sad to report that "all the spam" means "all incoming mail
that spamassassin scored over 10". For the ten days I collected it, I
got 14997 spam! If this is typical, I understand better why spam
filtering is such a big deal.
The ham came from a guy who's been working here since 1998. It's every
message he's sent or recieved since then. He claims he hand-filtered
spam out of it, but I know it's not that clean from timcv runs. I'm
working on hand-cleaning this and the spam corpus, but it's going to
take some time.
To test things, I hand-cleaned two mailboxes of co-workers, W and B.
Then I ran this code:
import TestDriver
from Options import options
import msgs
users = ("B", "W")
hamdir_template = "Data/Users/%s/Ham"
spamdir_template = "Data/Users/%s/Spam"
def drive(nsets):
print options.display()
spamdirs = [options.spam_directories % i for i in range(1, nsets+1)]
hamdirs = [options.ham_directories % i for i in range(1, nsets+1)]
d = TestDriver.Driver()
d.train(msgs.HamStream("%s-%d" % (hamdirs[0], nsets), hamdirs),
msgs.SpamStream("%s-%d" % (spamdirs[0], nsets), spamdirs))
for user in users:
hamdir = hamdir_template % user
spamdir = spamdir_template % user
d.test(msgs.HamStream(hamdir, [hamdir]),
msgs.SpamStream(spamdir, [spamdir]))
d.finishtest()
d.alldone()
drive(2)
So, here's the output:
[TestDriver]
show_histograms = True
show_best_discriminators = 30
nbuckets = 200
spam_cutoff = 0.560
pickle_basename = class
show_ham_lo = 1.0
show_false_negatives = True
best_cutoff_fn_weight = 1.00
ham_cutoff = 0.560
show_spam_hi = 0.0
show_unsure = False
show_spam_lo = 1.0
save_trained_pickles = False
show_ham_hi = 0.0
show_false_positives = True
spam_directories = Data/Spam/Set%d
percentiles = 5 25 75 95
compute_best_cutoffs_from_histograms = True
best_cutoff_fp_weight = 10.00
show_charlimit = 3000
best_cutoff_unsure_weight = 0.20
ham_directories = Data/Ham/Set%d
save_histogram_pickles = False
[CV Driver]
build_each_classifier_from_scratch = False
[Tokenizer]
mine_received_headers = False
octet_prefix_size = 5
generate_long_skips = True
count_all_header_lines = False
check_octets = False
ignore_redundant_html = False
basic_header_tokenize = True
safe_headers = abuse-reports-to
date
errors-to
from
importance
in-reply-to
message-id
mime-version
organization
received
reply-to
return-path
subject
to
user-agent
x-abuse-info
x-complaints-to
x-face
basic_header_skip = received
x-.*
delivered-to
date
basic_header_tokenize_only = False
retain_pure_html_tags = False
[Classifier]
use_mixed_combining = False
robinson_probability_x = 0.5
robinson_minimum_prob_strength = 0.1
robinson_probability_s = 0.45
use_chi_squared_combining = False
max_discriminators = 150
mixed_combining_chi_weight = 0.9
-> Training on Data/Ham/Set1-2 & Data/Spam/Set1-2 ... 400 hams & 400 spams
-> Predicting Data/Users/B/Ham & Data/Users/B/Spam ...
-> <stat> tested 121 hams & 23 spams against 400 hams & 400 spams
-> <stat> false positive %: 7.43801652893
-> <stat> false negative %: 0.0
-> <stat> unsure %: 0.0
-> <stat> cost: $90.00
-> <stat> 9 new false positives
[snip]
-> <stat> 0 new false negatives
-> <stat> 0 new unsure
best discriminators:
'edit' 42 0.0564005
'to:skip:w 10' 43 0.370886
'header:Received:4' 44 0.00169875
'subject:PERFORCE' 44 0.00585176
'subject:change' 44 0.00585176
'subject:review' 44 0.00570342
'to:skip:B 10' 44 0.0412844
'...' 46 0.181134
'message-id:@horus.inside.$FIRM' 46 0.00556242
'your' 46 0.758353
'affected' 47 0.00570342
'message-id:skip:h 20' 48 0.0416277
'precedence:bulk' 48 0.0429152
'header:MIME-Version:1' 49 0.346045
'url:com' 49 0.761515
'you' 49 0.650341
'content-type:plain' 50 0.177419
'from' 50 0.691328
'change' 52 0.267149
'this' 53 0.655698
'proto:http' 54 0.738164
'files' 57 0.125668
'header:Message-Id:1' 60 0.72845
'message-id:skip:2 20' 60 0.724415
'header:Message-ID:1' 64 0.298295
'from:email addr:$FIRM>' 72 0.00825756
'from:skip:w 10' 76 0.0214323
'return-path:skip:w 10' 98 0.038085
'header:Return-Path:1' 121 0.685963
'content-type:text/plain' 124 0.272913
-> <stat> Ham scores for this pair: 121 items; mean 35.83; sdev 13.31
-> <stat> min 17.1664; median 32.0866; max 71.2362
-> <stat> percentiles: 5% 20.1379; 25% 24.186; 75% 45.0782; 95% 60.3254
* = 1 items
0.0 0
0.5 0
1.0 0
1.5 0
2.0 0
2.5 0
3.0 0
3.5 0
4.0 0
4.5 0
5.0 0
5.5 0
6.0 0
6.5 0
7.0 0
7.5 0
8.0 0
8.5 0
9.0 0
9.5 0
10.0 0
10.5 0
11.0 0
11.5 0
12.0 0
12.5 0
13.0 0
13.5 0
14.0 0
14.5 0
15.0 0
15.5 0
16.0 0
16.5 0
17.0 1 *
17.5 0
18.0 0
18.5 2 **
19.0 0
19.5 2 **
20.0 4 ****
20.5 2 **
21.0 1 *
21.5 5 *****
22.0 5 *****
22.5 3 ***
23.0 2 **
23.5 2 **
24.0 3 ***
24.5 1 *
25.0 1 *
25.5 2 **
26.0 6 ******
26.5 4 ****
27.0 0
27.5 3 ***
28.0 2 **
28.5 1 *
29.0 1 *
29.5 3 ***
30.0 0
30.5 1 *
31.0 2 **
31.5 1 *
32.0 2 **
32.5 0
33.0 0
33.5 1 *
34.0 0
34.5 0
35.0 2 **
35.5 1 *
36.0 1 *
36.5 0
37.0 2 **
37.5 0
38.0 3 ***
38.5 0
39.0 0
39.5 0
40.0 2 **
40.5 1 *
41.0 1 *
41.5 0
42.0 4 ****
42.5 1 *
43.0 3 ***
43.5 3 ***
44.0 2 **
44.5 1 *
45.0 2 **
45.5 1 *
46.0 3 ***
46.5 0
47.0 0
47.5 2 **
48.0 2 **
48.5 1 *
49.0 2 **
49.5 0
50.0 2 **
50.5 3 ***
51.0 0
51.5 0
52.0 1 *
52.5 0
53.0 0
53.5 0
54.0 1 *
54.5 0
55.0 2 **
55.5 0
56.0 0
56.5 0
57.0 0
57.5 1 *
58.0 0
58.5 1 *
59.0 0
59.5 0
60.0 1 *
60.5 1 *
61.0 0
61.5 0
62.0 0
62.5 0
63.0 0
63.5 0
64.0 0
64.5 0
65.0 0
65.5 0
66.0 0
66.5 0
67.0 0
67.5 0
68.0 2 **
68.5 1 *
69.0 0
69.5 0
70.0 1 *
70.5 0
71.0 1 *
71.5 0
72.0 0
72.5 0
73.0 0
73.5 0
74.0 0
74.5 0
75.0 0
75.5 0
76.0 0
76.5 0
77.0 0
77.5 0
78.0 0
78.5 0
79.0 0
79.5 0
80.0 0
80.5 0
81.0 0
81.5 0
82.0 0
82.5 0
83.0 0
83.5 0
84.0 0
84.5 0
85.0 0
85.5 0
86.0 0
86.5 0
87.0 0
87.5 0
88.0 0
88.5 0
89.0 0
89.5 0
90.0 0
90.5 0
91.0 0
91.5 0
92.0 0
92.5 0
93.0 0
93.5 0
94.0 0
94.5 0
95.0 0
95.5 0
96.0 0
96.5 0
97.0 0
97.5 0
98.0 0
98.5 0
99.0 0
99.5 0
-> <stat> Spam scores for this pair: 23 items; mean 73.88; sdev 5.94
-> <stat> min 62.9927; median 74.0114; max 82.6517
-> <stat> percentiles: 5% 64.6143; 25% 70.2017; 75% 78.8789; 95% 82.2079
* = 1 items
0.0 0
0.5 0
1.0 0
1.5 0
2.0 0
2.5 0
3.0 0
3.5 0
4.0 0
4.5 0
5.0 0
5.5 0
6.0 0
6.5 0
7.0 0
7.5 0
8.0 0
8.5 0
9.0 0
9.5 0
10.0 0
10.5 0
11.0 0
11.5 0
12.0 0
12.5 0
13.0 0
13.5 0
14.0 0
14.5 0
15.0 0
15.5 0
16.0 0
16.5 0
17.0 0
17.5 0
18.0 0
18.5 0
19.0 0
19.5 0
20.0 0
20.5 0
21.0 0
21.5 0
22.0 0
22.5 0
23.0 0
23.5 0
24.0 0
24.5 0
25.0 0
25.5 0
26.0 0
26.5 0
27.0 0
27.5 0
28.0 0
28.5 0
29.0 0
29.5 0
30.0 0
30.5 0
31.0 0
31.5 0
32.0 0
32.5 0
33.0 0
33.5 0
34.0 0
34.5 0
35.0 0
35.5 0
36.0 0
36.5 0
37.0 0
37.5 0
38.0 0
38.5 0
39.0 0
39.5 0
40.0 0
40.5 0
41.0 0
41.5 0
42.0 0
42.5 0
43.0 0
43.5 0
44.0 0
44.5 0
45.0 0
45.5 0
46.0 0
46.5 0
47.0 0
47.5 0
48.0 0
48.5 0
49.0 0
49.5 0
50.0 0
50.5 0
51.0 0
51.5 0
52.0 0
52.5 0
53.0 0
53.5 0
54.0 0
54.5 0
55.0 0
55.5 0
56.0 0
56.5 0
57.0 0
57.5 0
58.0 0
58.5 0
59.0 0
59.5 0
60.0 0
60.5 0
61.0 0
61.5 0
62.0 0
62.5 1 *
63.0 0
63.5 0
64.0 0
64.5 2 **
65.0 1 *
65.5 0
66.0 0
66.5 1 *
67.0 0
67.5 0
68.0 0
68.5 0
69.0 0
69.5 0
70.0 2 **
70.5 0
71.0 0
71.5 1 *
72.0 0
72.5 0
73.0 2 **
73.5 1 *
74.0 2 **
74.5 0
75.0 1 *
75.5 1 *
76.0 0
76.5 0
77.0 0
77.5 0
78.0 1 *
78.5 2 **
79.0 0
79.5 0
80.0 1 *
80.5 1 *
81.0 1 *
81.5 0
82.0 1 *
82.5 1 *
83.0 0
83.5 0
84.0 0
84.5 0
85.0 0
85.5 0
86.0 0
86.5 0
87.0 0
87.5 0
88.0 0
88.5 0
89.0 0
89.5 0
90.0 0
90.5 0
91.0 0
91.5 0
92.0 0
92.5 0
93.0 0
93.5 0
94.0 0
94.5 0
95.0 0
95.5 0
96.0 0
96.5 0
97.0 0
97.5 0
98.0 0
98.5 0
99.0 0
99.5 0
-> best cost for this pair: $2.40
-> per-fp cost $10.00; per-fn cost $1.00; per-unsure cost $0.20
-> achieved at 4 cutoff pairs
-> smallest ham & spam cutoffs 0.61 & 0.715
-> fp 0; fn 0; unsure ham 5; unsure spam 7
-> fp rate 0%; fn rate 0%; unsure rate 8.33%
-> largest ham & spam cutoffs 0.625 & 0.715
-> fp 0; fn 0; unsure ham 5; unsure spam 7
-> fp rate 0%; fn rate 0%; unsure rate 8.33%
-> <stat> Ham scores for all in this training set: 121 items; mean 35.83; sdev 13.31
-> <stat> min 17.1664; median 32.0866; max 71.2362
-> <stat> percentiles: 5% 20.1379; 25% 24.186; 75% 45.0782; 95% 60.3254
* = 1 items
0.0 0
0.5 0
1.0 0
1.5 0
2.0 0
2.5 0
3.0 0
3.5 0
4.0 0
4.5 0
5.0 0
5.5 0
6.0 0
6.5 0
7.0 0
7.5 0
8.0 0
8.5 0
9.0 0
9.5 0
10.0 0
10.5 0
11.0 0
11.5 0
12.0 0
12.5 0
13.0 0
13.5 0
14.0 0
14.5 0
15.0 0
15.5 0
16.0 0
16.5 0
17.0 1 *
17.5 0
18.0 0
18.5 2 **
19.0 0
19.5 2 **
20.0 4 ****
20.5 2 **
21.0 1 *
21.5 5 *****
22.0 5 *****
22.5 3 ***
23.0 2 **
23.5 2 **
24.0 3 ***
24.5 1 *
25.0 1 *
25.5 2 **
26.0 6 ******
26.5 4 ****
27.0 0
27.5 3 ***
28.0 2 **
28.5 1 *
29.0 1 *
29.5 3 ***
30.0 0
30.5 1 *
31.0 2 **
31.5 1 *
32.0 2 **
32.5 0
33.0 0
33.5 1 *
34.0 0
34.5 0
35.0 2 **
35.5 1 *
36.0 1 *
36.5 0
37.0 2 **
37.5 0
38.0 3 ***
38.5 0
39.0 0
39.5 0
40.0 2 **
40.5 1 *
41.0 1 *
41.5 0
42.0 4 ****
42.5 1 *
43.0 3 ***
43.5 3 ***
44.0 2 **
44.5 1 *
45.0 2 **
45.5 1 *
46.0 3 ***
46.5 0
47.0 0
47.5 2 **
48.0 2 **
48.5 1 *
49.0 2 **
49.5 0
50.0 2 **
50.5 3 ***
51.0 0
51.5 0
52.0 1 *
52.5 0
53.0 0
53.5 0
54.0 1 *
54.5 0
55.0 2 **
55.5 0
56.0 0
56.5 0
57.0 0
57.5 1 *
58.0 0
58.5 1 *
59.0 0
59.5 0
60.0 1 *
60.5 1 *
61.0 0
61.5 0
62.0 0
62.5 0
63.0 0
63.5 0
64.0 0
64.5 0
65.0 0
65.5 0
66.0 0
66.5 0
67.0 0
67.5 0
68.0 2 **
68.5 1 *
69.0 0
69.5 0
70.0 1 *
70.5 0
71.0 1 *
71.5 0
72.0 0
72.5 0
73.0 0
73.5 0
74.0 0
74.5 0
75.0 0
75.5 0
76.0 0
76.5 0
77.0 0
77.5 0
78.0 0
78.5 0
79.0 0
79.5 0
80.0 0
80.5 0
81.0 0
81.5 0
82.0 0
82.5 0
83.0 0
83.5 0
84.0 0
84.5 0
85.0 0
85.5 0
86.0 0
86.5 0
87.0 0
87.5 0
88.0 0
88.5 0
89.0 0
89.5 0
90.0 0
90.5 0
91.0 0
91.5 0
92.0 0
92.5 0
93.0 0
93.5 0
94.0 0
94.5 0
95.0 0
95.5 0
96.0 0
96.5 0
97.0 0
97.5 0
98.0 0
98.5 0
99.0 0
99.5 0
-> <stat> Spam scores for all in this training set: 23 items; mean 73.88; sdev 5.94
-> <stat> min 62.9927; median 74.0114; max 82.6517
-> <stat> percentiles: 5% 64.6143; 25% 70.2017; 75% 78.8789; 95% 82.2079
* = 1 items
0.0 0
0.5 0
1.0 0
1.5 0
2.0 0
2.5 0
3.0 0
3.5 0
4.0 0
4.5 0
5.0 0
5.5 0
6.0 0
6.5 0
7.0 0
7.5 0
8.0 0
8.5 0
9.0 0
9.5 0
10.0 0
10.5 0
11.0 0
11.5 0
12.0 0
12.5 0
13.0 0
13.5 0
14.0 0
14.5 0
15.0 0
15.5 0
16.0 0
16.5 0
17.0 0
17.5 0
18.0 0
18.5 0
19.0 0
19.5 0
20.0 0
20.5 0
21.0 0
21.5 0
22.0 0
22.5 0
23.0 0
23.5 0
24.0 0
24.5 0
25.0 0
25.5 0
26.0 0
26.5 0
27.0 0
27.5 0
28.0 0
28.5 0
29.0 0
29.5 0
30.0 0
30.5 0
31.0 0
31.5 0
32.0 0
32.5 0
33.0 0
33.5 0
34.0 0
34.5 0
35.0 0
35.5 0
36.0 0
36.5 0
37.0 0
37.5 0
38.0 0
38.5 0
39.0 0
39.5 0
40.0 0
40.5 0
41.0 0
41.5 0
42.0 0
42.5 0
43.0 0
43.5 0
44.0 0
44.5 0
45.0 0
45.5 0
46.0 0
46.5 0
47.0 0
47.5 0
48.0 0
48.5 0
49.0 0
49.5 0
50.0 0
50.5 0
51.0 0
51.5 0
52.0 0
52.5 0
53.0 0
53.5 0
54.0 0
54.5 0
55.0 0
55.5 0
56.0 0
56.5 0
57.0 0
57.5 0
58.0 0
58.5 0
59.0 0
59.5 0
60.0 0
60.5 0
61.0 0
61.5 0
62.0 0
62.5 1 *
63.0 0
63.5 0
64.0 0
64.5 2 **
65.0 1 *
65.5 0
66.0 0
66.5 1 *
67.0 0
67.5 0
68.0 0
68.5 0
69.0 0
69.5 0
70.0 2 **
70.5 0
71.0 0
71.5 1 *
72.0 0
72.5 0
73.0 2 **
73.5 1 *
74.0 2 **
74.5 0
75.0 1 *
75.5 1 *
76.0 0
76.5 0
77.0 0
77.5 0
78.0 1 *
78.5 2 **
79.0 0
79.5 0
80.0 1 *
80.5 1 *
81.0 1 *
81.5 0
82.0 1 *
82.5 1 *
83.0 0
83.5 0
84.0 0
84.5 0
85.0 0
85.5 0
86.0 0
86.5 0
87.0 0
87.5 0
88.0 0
88.5 0
89.0 0
89.5 0
90.0 0
90.5 0
91.0 0
91.5 0
92.0 0
92.5 0
93.0 0
93.5 0
94.0 0
94.5 0
95.0 0
95.5 0
96.0 0
96.5 0
97.0 0
97.5 0
98.0 0
98.5 0
99.0 0
99.5 0
-> best cost for all in this training set: $2.40
-> per-fp cost $10.00; per-fn cost $1.00; per-unsure cost $0.20
-> achieved at 4 cutoff pairs
-> smallest ham & spam cutoffs 0.61 & 0.715
-> fp 0; fn 0; unsure ham 5; unsure spam 7
-> fp rate 0%; fn rate 0%; unsure rate 8.33%
-> largest ham & spam cutoffs 0.625 & 0.715
-> fp 0; fn 0; unsure ham 5; unsure spam 7
-> fp rate 0%; fn rate 0%; unsure rate 8.33%
This doesn't look--it's all over the map. However, IANAS, nor am I a
Tim, so I'll leave judgement up to you fine folks.
Here's W's mail:
-> Predicting Data/Users/W/Ham & Data/Users/W/Spam ...
-> <stat> tested 361 hams & 0 spams against 400 hams & 400 spams
-> <stat> false positive %: 1.38504155125
-> <stat> false negative %: 0.0
-> <stat> unsure %: 0.0
-> <stat> cost: $50.00
-> <stat> 5 new false positives
[snip]
-> <stat> 0 new false negatives
-> <stat> 0 new unsure
best discriminators:
'2002' 129 0.805927
'our' 129 0.799565
'message-----' 134 0.0121951
'subject:' 140 0.0297372
'from:' 141 0.0563603
'watchguard' 145 0.0213021
'subject:] ' 163 0.0148575
'are' 166 0.629383
'to:' 167 0.214615
'please' 180 0.839247
'from' 184 0.691328
'subject:-' 187 0.218038
'precedence:bulk' 189 0.0429152
'url:com' 196 0.761515
'your' 206 0.758353
'x-mailer:internet mail service (5.5.2653.19)' 210 0.00556242
'proto:http' 239 0.738164
'you' 239 0.650341
'to:skip:w 10' 240 0.370886
'content-type:text' 246 0.610023
'this' 281 0.655698
'content-type:charset' 287 0.33352
'content-type:plain' 306 0.177419
'return-path:skip:w 10' 312 0.038085
'from:email addr:$FIRM>' 318 0.00825756
'from:skip:w 10' 322 0.0214323
'header:MIME-Version:1' 322 0.346045
'header:Return-Path:1' 331 0.685963
'header:Message-ID:1' 358 0.298295
'content-type:text/plain' 456 0.272913
-> <stat> Ham scores for this pair: 361 items; mean 38.74; sdev 7.88
-> <stat> min 17.2567; median 39.624; max 63.0457
-> <stat> percentiles: 5% 24.5112; 25% 33.4889; 75% 44.0288; 95% 49.719
* = 1 items
0.0 0
0.5 0
1.0 0
1.5 0
2.0 0
2.5 0
3.0 0
3.5 0
4.0 0
4.5 0
5.0 0
5.5 0
6.0 0
6.5 0
7.0 0
7.5 0
8.0 0
8.5 0
9.0 0
9.5 0
10.0 0
10.5 0
11.0 0
11.5 0
12.0 0
12.5 0
13.0 0
13.5 0
14.0 0
14.5 0
15.0 0
15.5 0
16.0 0
16.5 0
17.0 5 *****
17.5 0
18.0 1 *
18.5 3 ***
19.0 0
19.5 2 **
20.0 0
20.5 0
21.0 0
21.5 1 *
22.0 0
22.5 0
23.0 3 ***
23.5 2 **
24.0 1 *
24.5 2 **
25.0 2 **
25.5 2 **
26.0 3 ***
26.5 2 **
27.0 4 ****
27.5 3 ***
28.0 0
28.5 1 *
29.0 5 *****
29.5 6 ******
30.0 4 ****
30.5 4 ****
31.0 2 **
31.5 5 *****
32.0 8 ********
32.5 9 *********
33.0 11 ***********
33.5 6 ******
34.0 6 ******
34.5 11 ***********
35.0 4 ****
35.5 3 ***
36.0 6 ******
36.5 7 *******
37.0 5 *****
37.5 7 *******
38.0 6 ******
38.5 11 ***********
39.0 14 **************
39.5 13 *************
40.0 7 *******
40.5 8 ********
41.0 7 *******
41.5 12 ************
42.0 15 ***************
42.5 8 ********
43.0 14 **************
43.5 6 ******
44.0 14 **************
44.5 9 *********
45.0 15 ***************
45.5 10 **********
46.0 2 **
46.5 4 ****
47.0 6 ******
47.5 7 *******
48.0 2 **
48.5 3 ***
49.0 3 ***
49.5 1 *
50.0 2 **
50.5 1 *
51.0 2 **
51.5 2 **
52.0 0
52.5 1 *
53.0 0
53.5 1 *
54.0 1 *
54.5 1 *
55.0 2 **
55.5 0
56.0 0
56.5 1 *
57.0 0
57.5 1 *
58.0 0
58.5 1 *
59.0 0
59.5 0
60.0 1 *
60.5 0
61.0 0
61.5 0
62.0 0
62.5 0
63.0 1 *
63.5 0
64.0 0
64.5 0
65.0 0
65.5 0
66.0 0
66.5 0
67.0 0
67.5 0
68.0 0
68.5 0
69.0 0
69.5 0
70.0 0
70.5 0
71.0 0
71.5 0
72.0 0
72.5 0
73.0 0
73.5 0
74.0 0
74.5 0
75.0 0
75.5 0
76.0 0
76.5 0
77.0 0
77.5 0
78.0 0
78.5 0
79.0 0
79.5 0
80.0 0
80.5 0
81.0 0
81.5 0
82.0 0
82.5 0
83.0 0
83.5 0
84.0 0
84.5 0
85.0 0
85.5 0
86.0 0
86.5 0
87.0 0
87.5 0
88.0 0
88.5 0
89.0 0
89.5 0
90.0 0
90.5 0
91.0 0
91.5 0
92.0 0
92.5 0
93.0 0
93.5 0
94.0 0
94.5 0
95.0 0
95.5 0
96.0 0
96.5 0
97.0 0
97.5 0
98.0 0
98.5 0
99.0 0
99.5 0
-> <stat> Spam scores for this pair:
-> <stat> Ham scores for all in this training set: 361 items; mean 38.74; sdev 7.88
-> <stat> min 17.2567; median 39.624; max 63.0457
-> <stat> percentiles: 5% 24.5112; 25% 33.4889; 75% 44.0288; 95% 49.719
* = 1 items
0.0 0
0.5 0
1.0 0
1.5 0
2.0 0
2.5 0
3.0 0
3.5 0
4.0 0
4.5 0
5.0 0
5.5 0
6.0 0
6.5 0
7.0 0
7.5 0
8.0 0
8.5 0
9.0 0
9.5 0
10.0 0
10.5 0
11.0 0
11.5 0
12.0 0
12.5 0
13.0 0
13.5 0
14.0 0
14.5 0
15.0 0
15.5 0
16.0 0
16.5 0
17.0 5 *****
17.5 0
18.0 1 *
18.5 3 ***
19.0 0
19.5 2 **
20.0 0
20.5 0
21.0 0
21.5 1 *
22.0 0
22.5 0
23.0 3 ***
23.5 2 **
24.0 1 *
24.5 2 **
25.0 2 **
25.5 2 **
26.0 3 ***
26.5 2 **
27.0 4 ****
27.5 3 ***
28.0 0
28.5 1 *
29.0 5 *****
29.5 6 ******
30.0 4 ****
30.5 4 ****
31.0 2 **
31.5 5 *****
32.0 8 ********
32.5 9 *********
33.0 11 ***********
33.5 6 ******
34.0 6 ******
34.5 11 ***********
35.0 4 ****
35.5 3 ***
36.0 6 ******
36.5 7 *******
37.0 5 *****
37.5 7 *******
38.0 6 ******
38.5 11 ***********
39.0 14 **************
39.5 13 *************
40.0 7 *******
40.5 8 ********
41.0 7 *******
41.5 12 ************
42.0 15 ***************
42.5 8 ********
43.0 14 **************
43.5 6 ******
44.0 14 **************
44.5 9 *********
45.0 15 ***************
45.5 10 **********
46.0 2 **
46.5 4 ****
47.0 6 ******
47.5 7 *******
48.0 2 **
48.5 3 ***
49.0 3 ***
49.5 1 *
50.0 2 **
50.5 1 *
51.0 2 **
51.5 2 **
52.0 0
52.5 1 *
53.0 0
53.5 1 *
54.0 1 *
54.5 1 *
55.0 2 **
55.5 0
56.0 0
56.5 1 *
57.0 0
57.5 1 *
58.0 0
58.5 1 *
59.0 0
59.5 0
60.0 1 *
60.5 0
61.0 0
61.5 0
62.0 0
62.5 0
63.0 1 *
63.5 0
64.0 0
64.5 0
65.0 0
65.5 0
66.0 0
66.5 0
67.0 0
67.5 0
68.0 0
68.5 0
69.0 0
69.5 0
70.0 0
70.5 0
71.0 0
71.5 0
72.0 0
72.5 0
73.0 0
73.5 0
74.0 0
74.5 0
75.0 0
75.5 0
76.0 0
76.5 0
77.0 0
77.5 0
78.0 0
78.5 0
79.0 0
79.5 0
80.0 0
80.5 0
81.0 0
81.5 0
82.0 0
82.5 0
83.0 0
83.5 0
84.0 0
84.5 0
85.0 0
85.5 0
86.0 0
86.5 0
87.0 0
87.5 0
88.0 0
88.5 0
89.0 0
89.5 0
90.0 0
90.5 0
91.0 0
91.5 0
92.0 0
92.5 0
93.0 0
93.5 0
94.0 0
94.5 0
95.0 0
95.5 0
96.0 0
96.5 0
97.0 0
97.5 0
98.0 0
98.5 0
99.0 0
99.5 0
-> <stat> Spam scores for all in this training set:
-> <stat> Ham scores for all runs: 482 items; mean 38.01; sdev 9.62
-> <stat> min 17.1664; median 39.2174; max 71.2362
-> <stat> percentiles: 5% 21.5969; 25% 31.7042; 75% 44.2002; 95% 51.8263
* = 1 items
0.0 0
0.5 0
1.0 0
1.5 0
2.0 0
2.5 0
3.0 0
3.5 0
4.0 0
4.5 0
5.0 0
5.5 0
6.0 0
6.5 0
7.0 0
7.5 0
8.0 0
8.5 0
9.0 0
9.5 0
10.0 0
10.5 0
11.0 0
11.5 0
12.0 0
12.5 0
13.0 0
13.5 0
14.0 0
14.5 0
15.0 0
15.5 0
16.0 0
16.5 0
17.0 6 ******
17.5 0
18.0 1 *
18.5 5 *****
19.0 0
19.5 4 ****
20.0 4 ****
20.5 2 **
21.0 1 *
21.5 6 ******
22.0 5 *****
22.5 3 ***
23.0 5 *****
23.5 4 ****
24.0 4 ****
24.5 3 ***
25.0 3 ***
25.5 4 ****
26.0 9 *********
26.5 6 ******
27.0 4 ****
27.5 6 ******
28.0 2 **
28.5 2 **
29.0 6 ******
29.5 9 *********
30.0 4 ****
30.5 5 *****
31.0 4 ****
31.5 6 ******
32.0 10 **********
32.5 9 *********
33.0 11 ***********
33.5 7 *******
34.0 6 ******
34.5 11 ***********
35.0 6 ******
35.5 4 ****
36.0 7 *******
36.5 7 *******
37.0 7 *******
37.5 7 *******
38.0 9 *********
38.5 11 ***********
39.0 14 **************
39.5 13 *************
40.0 9 *********
40.5 9 *********
41.0 8 ********
41.5 12 ************
42.0 19 *******************
42.5 9 *********
43.0 17 *****************
43.5 9 *********
44.0 16 ****************
44.5 10 **********
45.0 17 *****************
45.5 11 ***********
46.0 5 *****
46.5 4 ****
47.0 6 ******
47.5 9 *********
48.0 4 ****
48.5 4 ****
49.0 5 *****
49.5 1 *
50.0 4 ****
50.5 4 ****
51.0 2 **
51.5 2 **
52.0 1 *
52.5 1 *
53.0 0
53.5 1 *
54.0 2 **
54.5 1 *
55.0 4 ****
55.5 0
56.0 0
56.5 1 *
57.0 0
57.5 2 **
58.0 0
58.5 2 **
59.0 0
59.5 0
60.0 2 **
60.5 1 *
61.0 0
61.5 0
62.0 0
62.5 0
63.0 1 *
63.5 0
64.0 0
64.5 0
65.0 0
65.5 0
66.0 0
66.5 0
67.0 0
67.5 0
68.0 2 **
68.5 1 *
69.0 0
69.5 0
70.0 1 *
70.5 0
71.0 1 *
71.5 0
72.0 0
72.5 0
73.0 0
73.5 0
74.0 0
74.5 0
75.0 0
75.5 0
76.0 0
76.5 0
77.0 0
77.5 0
78.0 0
78.5 0
79.0 0
79.5 0
80.0 0
80.5 0
81.0 0
81.5 0
82.0 0
82.5 0
83.0 0
83.5 0
84.0 0
84.5 0
85.0 0
85.5 0
86.0 0
86.5 0
87.0 0
87.5 0
88.0 0
88.5 0
89.0 0
89.5 0
90.0 0
90.5 0
91.0 0
91.5 0
92.0 0
92.5 0
93.0 0
93.5 0
94.0 0
94.5 0
95.0 0
95.5 0
96.0 0
96.5 0
97.0 0
97.5 0
98.0 0
98.5 0
99.0 0
99.5 0
-> <stat> Spam scores for all runs: 23 items; mean 73.88; sdev 5.94
-> <stat> min 62.9927; median 74.0114; max 82.6517
-> <stat> percentiles: 5% 64.6143; 25% 70.2017; 75% 78.8789; 95% 82.2079
* = 1 items
0.0 0
0.5 0
1.0 0
1.5 0
2.0 0
2.5 0
3.0 0
3.5 0
4.0 0
4.5 0
5.0 0
5.5 0
6.0 0
6.5 0
7.0 0
7.5 0
8.0 0
8.5 0
9.0 0
9.5 0
10.0 0
10.5 0
11.0 0
11.5 0
12.0 0
12.5 0
13.0 0
13.5 0
14.0 0
14.5 0
15.0 0
15.5 0
16.0 0
16.5 0
17.0 0
17.5 0
18.0 0
18.5 0
19.0 0
19.5 0
20.0 0
20.5 0
21.0 0
21.5 0
22.0 0
22.5 0
23.0 0
23.5 0
24.0 0
24.5 0
25.0 0
25.5 0
26.0 0
26.5 0
27.0 0
27.5 0
28.0 0
28.5 0
29.0 0
29.5 0
30.0 0
30.5 0
31.0 0
31.5 0
32.0 0
32.5 0
33.0 0
33.5 0
34.0 0
34.5 0
35.0 0
35.5 0
36.0 0
36.5 0
37.0 0
37.5 0
38.0 0
38.5 0
39.0 0
39.5 0
40.0 0
40.5 0
41.0 0
41.5 0
42.0 0
42.5 0
43.0 0
43.5 0
44.0 0
44.5 0
45.0 0
45.5 0
46.0 0
46.5 0
47.0 0
47.5 0
48.0 0
48.5 0
49.0 0
49.5 0
50.0 0
50.5 0
51.0 0
51.5 0
52.0 0
52.5 0
53.0 0
53.5 0
54.0 0
54.5 0
55.0 0
55.5 0
56.0 0
56.5 0
57.0 0
57.5 0
58.0 0
58.5 0
59.0 0
59.5 0
60.0 0
60.5 0
61.0 0
61.5 0
62.0 0
62.5 1 *
63.0 0
63.5 0
64.0 0
64.5 2 **
65.0 1 *
65.5 0
66.0 0
66.5 1 *
67.0 0
67.5 0
68.0 0
68.5 0
69.0 0
69.5 0
70.0 2 **
70.5 0
71.0 0
71.5 1 *
72.0 0
72.5 0
73.0 2 **
73.5 1 *
74.0 2 **
74.5 0
75.0 1 *
75.5 1 *
76.0 0
76.5 0
77.0 0
77.5 0
78.0 1 *
78.5 2 **
79.0 0
79.5 0
80.0 1 *
80.5 1 *
81.0 1 *
81.5 0
82.0 1 *
82.5 1 *
83.0 0
83.5 0
84.0 0
84.5 0
85.0 0
85.5 0
86.0 0
86.5 0
87.0 0
87.5 0
88.0 0
88.5 0
89.0 0
89.5 0
90.0 0
90.5 0
91.0 0
91.5 0
92.0 0
92.5 0
93.0 0
93.5 0
94.0 0
94.5 0
95.0 0
95.5 0
96.0 0
96.5 0
97.0 0
97.5 0
98.0 0
98.5 0
99.0 0
99.5 0
-> best cost for all runs: $2.60
-> per-fp cost $10.00; per-fn cost $1.00; per-unsure cost $0.20
-> achieved at 4 cutoff pairs
-> smallest ham & spam cutoffs 0.61 & 0.715
-> fp 0; fn 0; unsure ham 6; unsure spam 7
-> fp rate 0%; fn rate 0%; unsure rate 2.57%
-> largest ham & spam cutoffs 0.625 & 0.715
-> fp 0; fn 0; unsure ham 6; unsure spam 7
-> fp rate 0%; fn rate 0%; unsure rate 2.57%
-> <stat> all runs false positives: 14
-> <stat> all runs false negatives: 0
-> <stat> all runs unsure: 0
-> <stat> all runs false positive %: 2.90456431535
-> <stat> all runs false negative %: 0.0
-> <stat> all runs unsure %: 0.0
-> <stat> all runs cost: $140.00
The f-ps are conference announcements, solicited commercial email, or
listserv responses.
Should I set the cutoff to 0.63? Do I owe Tim and Gary $140? Sorry I
can't answer these questions myself, but I've been lucky to skim subject
headers on this list lately so I don't know what all this new-fangled
stuff is. I realize the data is less than ideal, but it's all I can get
at the moment.
Aside from cleaning the training data, what should I do next?
Neale