[Spambayes] A perverse prob strength experiment
Tim Peters
tim.one@comcast.net
Sun, 22 Sep 2002 14:05:14 -0400
What if we ignored all words with spamprob *greater* than 0.1 away from 0.5?
That is, only look at the blandest words. That's an indicator of how much
relevant information is in the bland words. Apart from that, the same as
"the usual" default untweaked Robinson setup, although I had the foresight
<wink> to ask for 1000 histogram buckets:
[Classifier]
use_robinson_probability: True
use_robinson_combining: True
max_discriminators: 1500
[TestDriver]
spam_cutoff: 0.5
nbuckets: 1000
There is *some* information in the bland words; with the cutoff at 0.5,
total unique false pos 7372
total unique false neg 4345
average fp % 36.86
average fn % 31.0357142857
The variance in f-p rates across runs was very high:
f-p % f-n %
38.850 29.214
32.500 33.071
33.950 34.929
38.450 30.000
44.050 26.786
38.700 29.071
30.800 34.214
36.750 31.000
36.450 32.286
38.100 29.786
If there were no exploitable information, both error rates would be near
50%. A question is whether the bits of information in the bland words vote
"in the same direction" as the non-bland words. The idea that it should do
no *harm* to include the bland words assumes that they tend to vote in the
same direction as the non-bland words (which we know very strongly vote in
"the right" direction). I'm not sure that they do! The mean of the spam
distribution is a little higher than the mean of the ham distribution, but
both are far below the values of spam_cutoff that have worked best overall
by all reports so far:
Ham distribution for all runs:
20000 items; mean 49.76; sample sdev 0.70
* = 21 items
45.90 0
46.00 1 *
46.10 0
46.20 0
46.30 0
46.40 0
46.50 1 *
46.60 0
46.70 3 *
46.80 2 *
46.90 1 *
47.00 2 *
47.10 4 *
47.20 9 *
47.30 10 *
47.40 15 *
47.50 25 **
47.60 21 *
47.70 28 **
47.80 40 **
47.90 55 ***
48.00 60 ***
48.10 98 *****
48.20 117 ******
48.30 131 *******
48.40 182 *********
48.50 238 ************
48.60 298 ***************
48.70 340 *****************
48.80 404 ********************
48.90 521 *************************
49.00 617 ******************************
49.10 706 **********************************
49.20 833 ****************************************
49.30 892 *******************************************
49.40 1056 ***************************************************
49.50 1134 ******************************************************
49.60 1176 ********************************************************
49.70 1171 ********************************************************
49.80 1250 ************************************************************
49.90 1187 *********************************************************
50.00 1095 *****************************************************
50.10 1009 *************************************************
50.20 951 **********************************************
50.30 875 ******************************************
50.40 708 **********************************
50.50 600 *****************************
50.60 514 *************************
50.70 358 ******************
50.80 317 ****************
50.90 260 *************
51.00 188 *********
51.10 163 ********
51.20 89 *****
51.30 64 ****
51.40 52 ***
51.50 41 **
51.60 21 *
51.70 21 *
51.80 21 *
51.90 7 *
52.00 6 *
52.10 3 *
52.20 2 *
52.30 1 *
52.40 2 *
52.50 2 *
52.60 1 *
52.70 0
52.80 0
52.90 1 *
53.00 0
Spam distribution for all runs:
14000 items; mean 50.37; sample sdev 0.81
* = 14 items
46.90 0
47.00 8 *
47.10 3 *
47.20 4 *
47.30 0
47.40 1 *
47.50 2 *
47.60 8 *
47.70 7 *
47.80 10 *
47.90 13 *
48.00 18 **
48.10 21 **
48.20 21 **
48.30 30 ***
48.40 56 ****
48.50 92 *******
48.60 54 ****
48.70 66 *****
48.80 126 *********
48.90 146 ***********
49.00 196 **************
49.10 194 **************
49.20 212 ****************
49.30 296 **********************
49.40 311 ***********************
49.50 376 ***************************
49.60 412 ******************************
49.70 464 **********************************
49.80 552 ****************************************
49.90 646 ***********************************************
50.00 698 **************************************************
50.10 780 ********************************************************
50.20 736 *****************************************************
50.30 788 *********************************************************
50.40 684 *************************************************
50.50 672 ************************************************
50.60 629 *********************************************
50.70 557 ****************************************
50.80 543 ***************************************
50.90 528 **************************************
51.00 449 *********************************
51.10 470 **********************************
51.20 437 ********************************
51.30 343 *************************
51.40 253 *******************
51.50 254 *******************
51.60 194 **************
51.70 121 *********
51.80 125 *********
51.90 81 ******
52.00 69 *****
52.10 68 *****
52.20 43 ****
52.30 32 ***
52.40 21 **
52.50 15 **
52.60 19 **
52.70 18 **
52.80 9 *
52.90 2 *
53.00 14 *
53.10 3 *
53.20 0
To minimize the total # of misclassifications, spam_cutoff should be raised
to 0.505. The we would lose 4638 fp:
[slice of ham]
50.00 1095
50.10 1009
50.20 951
50.30 875
50.40 708
and gain 3686 fn:
[slice of spam]
50.00 698
50.10 780
50.20 736
50.30 788
50.40 684
Going beyond that would start raising the total number again, because the #
of ham in the .505 bucket is larger than the number of spam:
[slice of ham]
50.50 600
[slice of spam]
50.50 672
So, at 0.505, we'd see
total unique false pos 7372-4638 = 2734
total unique false neg 4345+3686 = 8031
average fp % 2734/20000 = 13.67%
average fn % 8031/14000 = 57.36%
However, at that point, rating spam is worse than flipping a coin!
Seriously, if it is in fact the case that the bland words are such that
reducing the f-p rate wrt them requires making the f-n rate wrt them worse
than flipping a coin, it's small surprise that we see a reduction in f-n
rate when leaving the bland words out.