[Spambayes] Only one .. some #'s
Brad Clements
bkc@murkworks.com
Sat, 28 Sep 2002 15:04:45 -0400
Sorry, I missed the "only one shoot out".
I've updated my src tree and re-ran timcv using mostly defaults. I then compared those
results with my last (earlier) results.
Note that I still have some spam in my ham.. I haven't changed my corpus since
starting the tests.
Executive summary: The algorithm is getting better every day.
So comparing (Old)
[TestDriver]
save_trained_pickles = True
show_histograms = True
show_ham_lo = 1.0
show_best_discriminators = 30
show_spam_lo = 0.0
show_ham_hi = 0.0
show_false_positives = True
pickle_basename = class
show_false_negatives = True
spam_cutoff = 0.575
nbuckets = 40
show_spam_hi = 0.45
show_charlimit = 3000
[Classifier]
min_spamprob = 0.01
use_robinson_combining = True
hambias = 2.0
use_robinson_ranking = False
spambias = 1.0
robinson_probability_x = 0.5
use_robinson_probability = True
robinson_minimum_prob_strength = 0.1
unknown_spamprob = 0.5
max_discriminators = 1500
robinson_probability_a = 1.0
max_spamprob = 0.99
[Tokenizer]
ignore_redundant_html = False
mine_received_headers = True
count_all_header_lines = False
retain_pure_html_tags = False
basic_header_tokenize = False
safe_headers = abuse-reports-to
with (New)
[TestDriver]
pickle_basename = class
save_trained_pickles = True
show_histograms = True
show_ham_lo = 1.0
show_best_discriminators = 30
nbuckets = 40
show_ham_hi = 0.0
spam_cutoff = 0.575
spam_directories = Data/Spam/Set%d
show_spam_lo = 0.0
show_false_negatives = True
ham_directories = Data/Ham/Set%d
compute_best_cutoffs_from_histograms = True
show_false_positives = True
best_cutoff_fp_weight = 1
show_spam_hi = 0.45
save_histogram_pickles = True
show_charlimit = 3000
[Classifier]
count_duplicates_only_once_in_training = True
robinson_probability_x = 0.5
robinson_minimum_prob_strength = 0.1
robinson_probability_s = 0.45
max_discriminators = 150
use_central_limit2 = False
use_central_limit = False
[Tokenizer]
mine_received_headers = True
octet_prefix_size = 5
count_all_header_lines = False
check_octets = False
ignore_redundant_html = False
mine_message_ids = False
basic_header_tokenize = False
safe_headers = abuse-reports-to
New Hist
-> <stat> Ham scores for all runs: 13000 items; mean 26.25; sdev 7.31
* = 41 items
0.0 22 *
2.5 17 *
5.0 26 *
7.5 82 **
10.0 87 ***
12.5 260 *******
15.0 476 ************
17.5 913 ***********************
20.0 1734 *******************************************
22.5 2479 *************************************************************
25.0 2288 ********************************************************
27.5 1680 *****************************************
30.0 948 ************************
32.5 557 **************
35.0 473 ************
37.5 428 ***********
40.0 158 ****
42.5 101 ***
45.0 81 **
47.5 59 **
50.0 50 **
52.5 29 *
55.0 19 *
57.5 13 *
60.0 5 *
62.5 5 *
65.0 3 *
67.5 2 *
70.0 2 *
72.5 1 *
75.0 0
77.5 2 *
80.0 0
-> <stat> Spam scores for all runs: 13000 items; mean 80.77; sdev 8.46
* = 38 items
0.0 0
2.5 0
5.0 0
7.5 0
10.0 0
12.5 0
15.0 0
17.5 0
20.0 1 *
22.5 3 *
25.0 2 *
27.5 3 *
30.0 0
32.5 6 *
35.0 37 *
37.5 37 *
40.0 12 *
42.5 8 *
45.0 11 *
47.5 13 *
50.0 15 *
52.5 36 *
55.0 59 **
57.5 65 **
60.0 131 ****
62.5 178 *****
65.0 244 *******
67.5 322 *********
70.0 532 **************
72.5 646 *****************
75.0 900 ************************
77.5 1386 *************************************
80.0 1891 **************************************************
82.5 2300 *************************************************************
85.0 2031 ******************************************************
87.5 1304 ***********************************
90.0 471 *************
92.5 232 *******
95.0 72 **
97.5 52 **
-> best cutoff for all runs: 0.525
-> with weighted total 1*81 fp + 148 fn = 229
-> fp rate 0.623% fn rate 1.14%
Comparing old (left) to new (right)
false positive percentages
0.538 0.308 won -42.75%
0.308 0.308 tied
0.308 0.231 won -25.00%
0.538 0.308 won -42.75%
0.385 0.231 won -40.00%
0.154 0.077 won -50.00%
0.462 0.231 won -50.00%
0.231 0.154 won -33.33%
0.538 0.385 won -28.44%
0.385 0.308 won -20.00%
won 9 times
tied 1 times
lost 0 times
total unique fp went from 50 to 33 won -34.00%
mean fp % went from 0.384615384615 to 0.253846153846 won -34.00%
false negative percentages
1.615 2.000 lost +23.84%
2.385 2.462 lost +3.23%
2.000 1.769 won -11.55%
1.615 1.692 lost +4.77%
1.769 1.769 tied
1.615 1.615 tied
1.385 1.692 lost +22.17%
1.846 1.615 won -12.51%
2.000 2.000 tied
2.077 2.077 tied
won 2 times
tied 4 times
lost 4 times
total unique fn went from 238 to 243 lost +2.10%
mean fn % went from 1.83076923077 to 1.86923076923 lost +2.10%
ham mean ham sdev
29.46 26.30 -10.73% 7.32 7.57 +3.42%
29.31 26.18 -10.68% 7.24 7.29 +0.69%
29.90 26.61 -11.00% 7.11 7.37 +3.66%
29.83 26.46 -11.30% 7.13 7.38 +3.51%
29.47 26.27 -10.86% 7.23 7.37 +1.94%
29.62 26.41 -10.84% 7.10 7.08 -0.28%
29.36 26.21 -10.73% 7.10 7.23 +1.83%
29.23 26.03 -10.95% 6.91 7.00 +1.30%
29.13 25.97 -10.85% 7.25 7.46 +2.90%
29.32 26.09 -11.02% 6.97 7.31 +4.88%
ham mean and sdev for all runs
29.46 26.25 -10.90% 7.14 7.31 +2.38%
spam mean spam sdev
78.63 80.57 +2.47% 7.79 8.42 +8.09%
78.67 80.68 +2.55% 8.25 8.95 +8.48%
78.85 80.83 +2.51% 8.15 8.70 +6.75%
79.00 81.11 +2.67% 7.81 8.22 +5.25%
78.79 80.64 +2.35% 7.77 8.20 +5.53%
79.05 81.05 +2.53% 7.37 7.82 +6.11%
78.52 80.65 +2.71% 7.60 8.26 +8.68%
78.71 80.70 +2.53% 7.99 8.41 +5.26%
79.02 80.99 +2.49% 8.05 8.64 +7.33%
78.46 80.49 +2.59% 8.37 8.87 +5.97%
spam mean and sdev for all runs
78.77 80.77 +2.54% 7.92 8.46 +6.82%
ham/spam mean difference: 49.31 54.52 +5.21
Brad Clements, bkc@murkworks.com (315)268-1000
http://www.murkworks.com (315)268-9812 Fax
AOL-IM: BKClements