[Spambayes] Proposing to remove 4 combining schemes

Tim Peters tim.one@comcast.net
Thu Oct 17 20:25:38 2002


[Rob W. W. Hooft, divides S and H and n by various things before
 computing chi2Q]

> Normal:
> -> best cost for all runs: $102.60
> -> per-fp cost $10.00; per-fn cost $1.00; per-unsure cost $0.20
> -> achieved at ham & spam cutoffs 0.495 & 0.96
> ->     fp 3; fn 14; unsure ham 40; unsure spam 253
> ->     fp rate 0.0187%; fn rate 0.241%; unsure rate 1.34%


> Dividing the log-products and n by 2:
> -> best cost for all runs: $104.40
> -> per-fp cost $10.00; per-fn cost $1.00; per-unsure cost $0.20
> -> achieved at ham & spam cutoffs 0.49 & 0.92
> ->     fp 3; fn 14; unsure ham 43; unsure spam 259
> ->     fp rate 0.0187%; fn rate 0.241%; unsure rate 1.39%

> Dividing the log-products and n by 4:
> -> best cost for all runs: $108.20
> -> per-fp cost $10.00; per-fn cost $1.00; per-unsure cost $0.20
> -> achieved at 2 cutoff pairs
> -> smallest ham & spam cutoffs 0.48 & 0.855
> ->     fp 4; fn 13; unsure ham 46; unsure spam 230
> ->     fp rate 0.025%; fn rate 0.224%; unsure rate 1.27%
> -> largest ham & spam cutoffs 0.485 & 0.855
> ->     fp 4; fn 14; unsure ham 42; unsure spam 229
> ->     fp rate 0.025%; fn rate 0.241%; unsure rate 1.24%

> As I expected, this significantly broadens the extremes at only very
> little cost.

But what's the point?  By your own cost measure, it didn't do you any good,
and in fact it raised your FP rate by the time you got to 4.


> What this does statistically is downweighting all clues thereby taking
> care of a "standard" correlation between clues.  This may be
> functionally equivalent to raising the value of s.

I doubt the latter, but if it's true I'd much rather get there by raising s,
which is symmetric and comprehensible.  Fudging H, S and n introduces
strange biases, because the info you're feeding into chi2Q no longer follows
a chi-squared distribution after fudging, and chi2Q may as well be some form
of biased random-number generator then.

> This is the /4 code for reference:
>
> Index: classifier.py
> ===================================================================
> RCS file: /cvsroot/spambayes/spambayes/classifier.py,v
> retrieving revision 1.38
> diff -u -r1.38 classifier.py
> --- classifier.py	14 Oct 2002 02:20:35 -0000	1.38
> +++ classifier.py	17 Oct 2002 15:24:55 -0000
> @@ -516,7 +516,10 @@
>           S = ln(S) + Sexp * LN2
>           H = ln(H) + Hexp * LN2
>
> -        n = len(clues)
> +        S = S/4.0
> +        H = H/4.0
> +
> +        n = len(clues)//4
>           if n:
>               S = 1.0 - chi2Q(-2.0 * S, 2*n)
>               H = 1.0 - chi2Q(-2.0 * H, 2*n)

Fiddle chi2.judge() to play with this.  Here's the straight H distribution
(S is similar) on vectors of 52 random probs:

52 random probs
H  10000 items; mean 0.50; sdev 0.29
-> <stat> min 0.000119708; median 0.500356; max 0.999988
* = 9 items
0.00 498 ********************************************************
0.05 494 *******************************************************
0.10 504 ********************************************************
0.15 546 *************************************************************
0.20 484 ******************************************************
0.25 470 *****************************************************
0.30 494 *******************************************************
0.35 491 *******************************************************
0.40 505 *********************************************************
0.45 513 *********************************************************
0.50 504 ********************************************************
0.55 474 *****************************************************
0.60 500 ********************************************************
0.65 502 ********************************************************
0.70 501 ********************************************************
0.75 542 *************************************************************
0.80 517 **********************************************************
0.85 443 **************************************************
0.90 514 **********************************************************
0.95 504 ********************************************************

Do the same but divide everything by 4 first (as you showed), and H is no
longer uniformly distributed:

52 random probs
H/4 & n//4  10000 items; mean 0.52; sdev 0.18
-> <stat> min 0.0144875; median 0.527973; max 0.973816
* = 17 items
0.00    4 *
0.05   47 ***
0.10  116 *******
0.15  238 **************
0.20  303 ******************
0.25  498 ******************************
0.30  631 **************************************
0.35  781 **********************************************
0.40  900 *****************************************************
0.45  933 *******************************************************
0.50  967 *********************************************************
0.55 1017 ************************************************************
0.60  893 *****************************************************
0.65  812 ************************************************
0.70  699 ******************************************
0.75  519 *******************************
0.80  339 ********************
0.85  208 *************
0.90   87 ******
0.95    8 *

The bias also shifts according to the number of extreme words in a msg
modulo 4, getting more lopsided the larger n%4:

53 random probs
H/4 & n//4  10000 items; mean 0.55; sdev 0.18
-> <stat> min 0.030539; median 0.554048; max 0.975847
* = 17 items
0.00    3 *
0.05   24 **
0.10   74 *****
0.15  133 ********
0.20  261 ****************
0.25  420 *************************
0.30  558 *********************************
0.35  706 ******************************************
0.40  822 *************************************************
0.45  936 ********************************************************
0.50  995 ***********************************************************
0.55 1007 ************************************************************
0.60  989 ***********************************************************
0.65  866 ***************************************************
0.70  804 ************************************************
0.75  642 **************************************
0.80  396 ************************
0.85  247 ***************
0.90  106 *******
0.95   11 *


54 random probs
H/4 & n//4  items; mean 0.57; sdev 0.17
-> <stat> min 0.0562266; median 0.579539; max 0.984772
* = 17 items
0.00    0
0.05   14 *
0.10   47 ***
0.15   97 ******
0.20  201 ************
0.25  327 ********************
0.30  478 *****************************
0.35  643 **************************************
0.40  744 ********************************************
0.45  868 ****************************************************
0.50  981 **********************************************************
0.55 1020 ************************************************************
0.60 1004 ************************************************************
0.65  968 *********************************************************
0.70  894 *****************************************************
0.75  750 *********************************************
0.80  532 ********************************
0.85  298 ******************
0.90  112 *******
0.95   22 **


55 random probs
H/4 & n//4  10000 items; mean 0.60; sdev 0.17
-> <stat> min 0.0477139; median 0.61042; max 0.971135
* = 19 items
0.00    1 *
0.05    7 *
0.10   26 **
0.15   84 *****
0.20  153 *********
0.25  270 ***************
0.30  359 *******************
0.35  452 ************************
0.40  659 ***********************************
0.45  819 ********************************************
0.50  919 *************************************************
0.55 1022 ******************************************************
0.60 1108 ***********************************************************
0.65 1088 **********************************************************
0.70  959 ***************************************************
0.75  792 ******************************************
0.80  661 ***********************************
0.85  412 **********************
0.90  186 **********
0.95   23 **

So, sorry, but overall this strikes me as the kind of thing we worked like
hell to get away from in Paul's scheme:  strange and inconsistent biases
that don't actually help, but at least cancel each other out when you get
lucky <wink>.  Extremity merely for the sake of extremity was no virtue, and
neither is its converse.