[Spambayes] Bland word only score..

Rob Hooft rob@hooft.net
Sun, 13 Oct 2002 18:06:55 +0200


[Tim: the previous copy of this message I sent to you was too quick.]

Tim Peters wrote:

 > It's been my belief that bland words are at best worthless as clues,
 > and at worst actively hurt (experiment:  fiddle your favorite scheme
 > to look *only* at the bland words; do they have predictive power?).

Just for kicks: Yes, with the latest schema and (S-H+1)/2, it does give 
a third of a standard deviation of separation on my sets. And the best 
is: it doesn't have any false positives :-P

[Classifier]
use_chi_squared_combining: True
robinson_minimum_prob_strength = 0.1

[TestDriver]
spam_cutoff: 0.70

nbuckets: 200
best_cutoff_fp_weight: 10

Obviously, the robinson_minimum_prob_strength test is inverted in the code.

-> <stat> Ham scores for all runs: 16000 items; mean 49.59; sdev 1.41
-> <stat> min 40.7953; median 49.9561; max 57.7839

40.0    0
40.5    1 *
41.0    0
41.5    1 *
42.0    9 *
42.5    8 *
43.0   17 *
43.5   31 *
44.0   35 *
44.5   61 *
45.0   95 **
45.5  136 **
46.0  186 ***
46.5  317 *****
47.0  383 ******
47.5  572 ********
48.0  832 ***********
48.5 1101 ***************
49.0 1455 ********************
49.5 3829 ***************************************************
50.0 4625 *************************************************************
50.5 1024 **************
51.0  520 *******
51.5  275 ****
52.0  176 ***
52.5  108 **
53.0   71 *
53.5   66 *
54.0   30 *
54.5   16 *
55.0   10 *
55.5    4 *
56.0    3 *
56.5    2 *
57.0    0
57.5    1 *
58.0    0
58.5    0
59.0    0
59.5    0
60.0    0

-> <stat> Spam scores for all runs: 5800 items; mean 50.39; sdev 1.25
-> <stat> min 43.2803; median 50.2241; max 59.1799

40.0    0
40.5    0
41.0    0
41.5    0
42.0    0
42.5    0
43.0    1 *
43.5    2 *
44.0    1 *
44.5    4 *
45.0    8 *
45.5   12 *
46.0   30 *
46.5   38 *
47.0   53 **
47.5   65 **
48.0   94 ***
48.5  118 ***
49.0  206 *****
49.5  497 ************
50.0 2580 ************************************************************
50.5  925 **********************
51.0  493 ************
51.5  234 ******
52.0  135 ****
52.5   88 ***
53.0   95 ***
53.5   43 *
54.0   22 *
54.5   22 *
55.0   17 *
55.5    7 *
56.0    3 *
56.5    2 *
57.0    1 *
57.5    1 *
58.0    0
58.5    0
59.0    3 *
59.5    0
60.0    0

-> best cutoff for all runs: 0.58
->     with weighted total 10*0 fp + 5797 fn = 5797
->     fp rate 0%  fn rate 99.9%


-- 
Rob W.W. Hooft  ||  rob@hooft.net  ||  http://www.hooft.net/people/rob/