[Spambayes] A perverse prob strength experiment

Sun, 22 Sep 2002 14:05:14 -0400

What if we ignored all words with spamprob *greater* than 0.1 away from 0.5?
That is, only look at the blandest words.  That's an indicator of how much
relevant information is in the bland words.  Apart from that, the same as
"the usual" default untweaked Robinson setup, although I had the foresight
<wink> to ask for 1000 histogram buckets:

[Classifier]
use_robinson_probability: True
use_robinson_combining: True
max_discriminators: 1500

[TestDriver]
spam_cutoff: 0.5
nbuckets: 1000

There is *some* information in the bland words; with the cutoff at 0.5,

total unique false pos 7372
total unique false neg 4345
average fp % 36.86
average fn % 31.0357142857

The variance in f-p rates across runs was very high:

      f-p %   f-n %
     38.850  29.214
     32.500  33.071
     33.950  34.929
     38.450  30.000
     44.050  26.786
     38.700  29.071
     30.800  34.214
     36.750  31.000
     36.450  32.286
     38.100  29.786

If there were no exploitable information, both error rates would be near
50%.  A question is whether the bits of information in the bland words vote
"in the same direction" as the non-bland words.  The idea that it should do
no *harm* to include the bland words assumes that they tend to vote in the
same direction as the non-bland words (which we know very strongly vote in
"the right" direction).  I'm not sure that they do!  The mean of the spam
distribution is a little higher than the mean of the ham distribution, but
both are far below the values of spam_cutoff that have worked best overall
by all reports so far:

Ham distribution for all runs:
20000 items; mean 49.76; sample sdev 0.70
* = 21 items
 45.90    0
 46.00    1 *
 46.10    0
 46.20    0
 46.30    0
 46.40    0
 46.50    1 *
 46.60    0
 46.70    3 *
 46.80    2 *
 46.90    1 *
 47.00    2 *
 47.10    4 *
 47.20    9 *
 47.30   10 *
 47.40   15 *
 47.50   25 **
 47.60   21 *
 47.70   28 **
 47.80   40 **
 47.90   55 ***
 48.00   60 ***
 48.10   98 *****
 48.20  117 ******
 48.30  131 *******
 48.40  182 *********
 48.50  238 ************
 48.60  298 ***************
 48.70  340 *****************
 48.80  404 ********************
 48.90  521 *************************
 49.00  617 ******************************
 49.10  706 **********************************
 49.20  833 ****************************************
 49.30  892 *******************************************
 49.40 1056 ***************************************************
 49.50 1134 ******************************************************
 49.60 1176 ********************************************************
 49.70 1171 ********************************************************
 49.80 1250 ************************************************************
 49.90 1187 *********************************************************
 50.00 1095 *****************************************************
 50.10 1009 *************************************************
 50.20  951 **********************************************
 50.30  875 ******************************************
 50.40  708 **********************************
 50.50  600 *****************************
 50.60  514 *************************
 50.70  358 ******************
 50.80  317 ****************
 50.90  260 *************
 51.00  188 *********
 51.10  163 ********
 51.20   89 *****
 51.30   64 ****
 51.40   52 ***
 51.50   41 **
 51.60   21 *
 51.70   21 *
 51.80   21 *
 51.90    7 *
 52.00    6 *
 52.10    3 *
 52.20    2 *
 52.30    1 *
 52.40    2 *
 52.50    2 *
 52.60    1 *
 52.70    0
 52.80    0
 52.90    1 *
 53.00    0

Spam distribution for all runs:
14000 items; mean 50.37; sample sdev 0.81
* = 14 items
 46.90   0
 47.00   8 *
 47.10   3 *
 47.20   4 *
 47.30   0
 47.40   1 *
 47.50   2 *
 47.60   8 *
 47.70   7 *
 47.80  10 *
 47.90  13 *
 48.00  18 **
 48.10  21 **
 48.20  21 **
 48.30  30 ***
 48.40  56 ****
 48.50  92 *******
 48.60  54 ****
 48.70  66 *****
 48.80 126 *********
 48.90 146 ***********
 49.00 196 **************
 49.10 194 **************
 49.20 212 ****************
 49.30 296 **********************
 49.40 311 ***********************
 49.50 376 ***************************
 49.60 412 ******************************
 49.70 464 **********************************
 49.80 552 ****************************************
 49.90 646 ***********************************************
 50.00 698 **************************************************
 50.10 780 ********************************************************
 50.20 736 *****************************************************
 50.30 788 *********************************************************
 50.40 684 *************************************************
 50.50 672 ************************************************
 50.60 629 *********************************************
 50.70 557 ****************************************
 50.80 543 ***************************************
 50.90 528 **************************************
 51.00 449 *********************************
 51.10 470 **********************************
 51.20 437 ********************************
 51.30 343 *************************
 51.40 253 *******************
 51.50 254 *******************
 51.60 194 **************
 51.70 121 *********
 51.80 125 *********
 51.90  81 ******
 52.00  69 *****
 52.10  68 *****
 52.20  43 ****
 52.30  32 ***
 52.40  21 **
 52.50  15 **
 52.60  19 **
 52.70  18 **
 52.80   9 *
 52.90   2 *
 53.00  14 *
 53.10   3 *
 53.20   0

To minimize the total # of misclassifications, spam_cutoff should be raised
to 0.505.  The we would lose 4638 fp:

[slice of ham]
 50.00 1095
 50.10 1009
 50.20  951
 50.30  875
 50.40  708

and gain 3686 fn:

[slice of spam]
 50.00  698
 50.10  780
 50.20  736
 50.30  788
 50.40  684

Going beyond that would start raising the total number again, because the #
of ham in the .505 bucket is larger than the number of spam:

[slice of ham]
 50.50  600

[slice of spam]
 50.50  672

So, at 0.505, we'd see

total unique false pos 7372-4638 = 2734
total unique false neg 4345+3686 = 8031
average fp % 2734/20000 = 13.67%
average fn % 8031/14000 = 57.36%

However, at that point, rating spam is worse than flipping a coin!
Seriously, if it is in fact the case that the bland words are such that
reducing the f-p rate wrt them requires making the f-n rate wrt them worse
than flipping a coin, it's small surprise that we see a reduction in f-n
rate when leaving the bland words out.