[Spambayes] Bland word only score..
Rob Hooft
rob@hooft.net
Sun, 13 Oct 2002 18:06:55 +0200
[Tim: the previous copy of this message I sent to you was too quick.]
Tim Peters wrote:
> It's been my belief that bland words are at best worthless as clues,
> and at worst actively hurt (experiment: fiddle your favorite scheme
> to look *only* at the bland words; do they have predictive power?).
Just for kicks: Yes, with the latest schema and (S-H+1)/2, it does give
a third of a standard deviation of separation on my sets. And the best
is: it doesn't have any false positives :-P
[Classifier]
use_chi_squared_combining: True
robinson_minimum_prob_strength = 0.1
[TestDriver]
spam_cutoff: 0.70
nbuckets: 200
best_cutoff_fp_weight: 10
Obviously, the robinson_minimum_prob_strength test is inverted in the code.
-> <stat> Ham scores for all runs: 16000 items; mean 49.59; sdev 1.41
-> <stat> min 40.7953; median 49.9561; max 57.7839
40.0 0
40.5 1 *
41.0 0
41.5 1 *
42.0 9 *
42.5 8 *
43.0 17 *
43.5 31 *
44.0 35 *
44.5 61 *
45.0 95 **
45.5 136 **
46.0 186 ***
46.5 317 *****
47.0 383 ******
47.5 572 ********
48.0 832 ***********
48.5 1101 ***************
49.0 1455 ********************
49.5 3829 ***************************************************
50.0 4625 *************************************************************
50.5 1024 **************
51.0 520 *******
51.5 275 ****
52.0 176 ***
52.5 108 **
53.0 71 *
53.5 66 *
54.0 30 *
54.5 16 *
55.0 10 *
55.5 4 *
56.0 3 *
56.5 2 *
57.0 0
57.5 1 *
58.0 0
58.5 0
59.0 0
59.5 0
60.0 0
-> <stat> Spam scores for all runs: 5800 items; mean 50.39; sdev 1.25
-> <stat> min 43.2803; median 50.2241; max 59.1799
40.0 0
40.5 0
41.0 0
41.5 0
42.0 0
42.5 0
43.0 1 *
43.5 2 *
44.0 1 *
44.5 4 *
45.0 8 *
45.5 12 *
46.0 30 *
46.5 38 *
47.0 53 **
47.5 65 **
48.0 94 ***
48.5 118 ***
49.0 206 *****
49.5 497 ************
50.0 2580 ************************************************************
50.5 925 **********************
51.0 493 ************
51.5 234 ******
52.0 135 ****
52.5 88 ***
53.0 95 ***
53.5 43 *
54.0 22 *
54.5 22 *
55.0 17 *
55.5 7 *
56.0 3 *
56.5 2 *
57.0 1 *
57.5 1 *
58.0 0
58.5 0
59.0 3 *
59.5 0
60.0 0
-> best cutoff for all runs: 0.58
-> with weighted total 10*0 fp + 5797 fn = 5797
-> fp rate 0% fn rate 99.9%
--
Rob W.W. Hooft || rob@hooft.net || http://www.hooft.net/people/rob/