The thing about the geometric mean is that it is much more sensitive to numbers near 0, so the S/(S+H) technique is biased in that way. If you want to try something like that, I would suggest using the ARITHMETIC means in computing S and H and again using S(S+H). That would remove that bias. It wouldn't be invoking that optimality theorem, but whatever works... It really seems, as a matter of being educated, that the arithmetic approach is worth trying if it doesn't take a lot of trouble to try it.
"but more sensitive to overwhelming amounts of evidence than Gary-combining"
From the email you sent at 1:02PM yesterday:
0.40 0 0.45 2 * 0.50 412 ********* 0.55 3068 ************************************************************* 0.60 1447 ***************************** 0.65 71 ** 0.70 0 One thing I'd like to be more clear on. If I understand the experiment correctly you set 10 to .99 and 40 were random. What percentage actually ended up as > .5, without regard to HOW MUCH over .5? '
It's hard to know what to make of this, especially in light of the claim that Gary-combining has been proven to be the most sensitive possible test for rejecting the hypothesis that a collection of probs is uniformly distributed.
It's not the (S-H)/(S+H) that is the most sensitive (under certain conditions), it that the geometric mean approach for computing S gives a result that is MONOTONIC WITH a calculation which is the most sensitive. The real technique would take S and feed it into an inverse chi-square function with (in this experiment) 100 degrees of freedom. The output (roughly speaking) would be the probability that that S (or a more extreme one) might have occurred by chance alone. Call these numbers S' and H' for S and H respectively. The calculation (S-H)/(S+H) will be > 0 if and only if (S'-H')/(S'+H') (unless I've made some error). So, as a binary indicator, the two are equivalent. However, if you used S' and H', you would see something more like real probabilities that would probably be of magnitudes that would be more attractive to you. You could probably use a table to approximate the inverse chi-square calc rather than actually doing the computations all the time. I didn't suggest doing that, at first, because I was interested in providing a binary indicator and wanting to keep things simple -- and from the POV of a binary indicator, it doesn't make any difference. So, if it happens that feel like taking the time to go "all the way" with this approach, I would suggest actually computing S' and H' and seeing what happens. I think you would like the results better -- I just didn't suggest it at first because I didn't know the spread would be of such interest and I wanted to keep things simple. I think this would work better than the S/(S+H) approach, because if you use geometric means, it's more sensitive to one condition than the other, and if you use arithmetic means, you don't invoke the optimality theorem. Of course, this is ALL speculative. But the probabilities involved will DEFINATELY be of greater magnitude, and so a better-defined spread, if the inverse chi-square is used. --Gary -- Gary Robinson CEO Transpose, LLC grobinson@transpose.com 207-942-3463 http://www.emergentmusic.com http://radio.weblogs.com/0101454
From: Tim Peters <tim.one@comcast.net> Date: Wed, 09 Oct 2002 20:34:15 -0400 To: SpamBayes <spambayes@python.org> Cc: Gary Robinson <grobinson@transpose.com> Subject: RE: [Spambayes] spamprob combining
[Tim]
... Intuitively, it *seems* like it would be good to get something not so insanely sensitive to random input as Paul-combining, but more sensitive to overwhelming amounts of evidence than Gary-combining.
So there's a new option,
[Classifier] use_tim_combining: True
The comments (from Options.py) explain it:
# For the default scheme, use "tim-combining" of probabilities. This # has no effect under the central-limit schemes. Tim-combining is a # kind of cross between Paul Graham's and Gary Robinson's combining # schemes. Unlike Paul's, it's never crazy-certain, and compared to # Gary's, in Tim's tests it greatly increased the spread between mean # ham-scores and spam-scores, while simultaneously decreasing the # variance of both. Tim needed a higher spam_cutoff value for best # results, but spam_cutoff is less touchy than under Gary-combining. use_tim_combining: False
"Tim combining" simply takes the geometric mean of the spamprobs as a measure of spamminess S, and the geometric mean of 1-spamprob as a measure of hamminess H, then returns S/(S+H) as "the score". This is well-behaved when fed random, uniformly distributed probabilities, but isn't reluctant to let an overwhelming number of extreme clues lead it to an extreme conclusion (although you're not going to see it give Graham-like 1e-30 or 1.0000000000000 scores).
Don't use a central-limit scheme with this (it has no effect on those). If you test it, use whatever variations on the "all default" scheme you usually use, but it will probably help to boost spam_cutoff. Note that the default max_discriminators is still 150, and that's what I used below.
Here's a 10-set cross-validation run on my data, restricted to 100 ham and 100 spam per set, with all defaults, except
before after ------ ----- use_tim_combining False True spam_cutoff 0.55 0.615
-> <stat> tested 100 hams & 100 spams against 900 hams & 900 spams [ditto 19 times]
false positive percentages 0.000 0.000 tied 1.000 0.000 won -100.00% 1.000 1.000 tied 0.000 0.000 tied 0.000 0.000 tied 0.000 0.000 tied 0.000 0.000 tied 0.000 0.000 tied 0.000 0.000 tied 0.000 0.000 tied
won 1 times tied 9 times lost 0 times
total unique fp went from 2 to 1 won -50.00% mean fp % went from 0.2 to 0.1 won -50.00%
false negative percentages 0.000 0.000 tied 0.000 0.000 tied 0.000 0.000 tied 0.000 0.000 tied 0.000 0.000 tied 0.000 0.000 tied 0.000 0.000 tied 0.000 0.000 tied 1.000 1.000 tied 0.000 0.000 tied
won 0 times tied 10 times lost 0 times
total unique fn went from 1 to 1 tied mean fn % went from 0.1 to 0.1 tied
The real story here is in the score distributions; contrary to what the comment said above, the ham-score variance increased with this little data:
ham mean ham sdev 30.63 18.80 -38.62% 6.03 6.83 +13.27% 29.31 17.35 -40.81% 5.48 6.84 +24.82% 29.96 18.50 -38.25% 6.95 9.02 +29.78% 29.66 18.12 -38.91% 5.89 6.81 +15.62% 29.51 17.34 -41.24% 5.73 6.71 +17.10% 29.40 17.43 -40.71% 5.73 6.61 +15.36% 29.75 17.74 -40.37% 5.76 6.96 +20.83% 29.71 18.17 -38.84% 5.97 6.48 +8.54% 31.98 20.41 -36.18% 5.96 8.02 +34.56% 29.83 18.11 -39.29% 4.75 5.41 +13.89%
ham mean and sdev for all runs 29.97 18.20 -39.27% 5.90 7.08 +20.00%
spam mean spam sdev 79.23 88.38 +11.55% 6.96 5.52 -20.69% 79.40 88.70 +11.71% 7.00 5.64 -19.43% 78.68 88.06 +11.92% 6.69 5.13 -23.32% 79.65 89.01 +11.75% 7.20 5.22 -27.50% 79.91 88.87 +11.21% 6.35 4.67 -26.46% 80.47 89.16 +10.80% 7.22 6.06 -16.07% 80.94 89.78 +10.92% 6.60 4.45 -32.58% 80.30 89.41 +11.34% 6.95 5.49 -21.01% 78.54 87.70 +11.66% 7.30 6.45 -11.64% 80.06 89.06 +11.24% 6.98 5.43 -22.21%
spam mean and sdev for all runs 79.72 88.81 +11.40% 6.97 5.47 -21.52%
ham/spam mean difference: 49.75 70.61 +20.86
So before, the score equidistant from both means was 52.78, at 3.87 sdevs from each; after, it was 58.03, at 5.63 sdevs from each. The populations are much better separated by this measure.
Histograms before:
-> <stat> Ham scores for all runs: 1000 items; mean 29.97; sdev 5.90 -> <stat> min 13.521; median 29.6919; max 60.8937 * = 2 items ... 13 2 * 14 0 15 2 * 16 8 **** 17 4 ** 18 9 ***** 19 17 ********* 20 14 ******* 21 16 ******** 22 24 ************ 23 38 ******************* 24 47 ************************ 25 62 ******************************* 26 65 ********************************* 27 69 *********************************** 28 73 ************************************* 29 70 *********************************** 30 76 ************************************** 31 70 *********************************** 32 61 ******************************* 33 51 ************************** 34 50 ************************* 35 34 ***************** 36 30 *************** 37 27 ************** 38 18 ********* 39 12 ****** 40 11 ****** 41 13 ******* 42 2 * 43 5 *** 44 8 **** 45 2 * 46 1 * 47 3 ** 48 1 * 49 0 50 3 ** 51 0 52 0 53 0 54 0 55 1 * 56 0 57 0 58 0 59 0 60 1 * ...
-> <stat> Spam scores for all runs: 1000 items; mean 79.72; sdev 6.97 -> <stat> min 52.3428; median 79.9799; max 98.1879 * = 2 items ... 52 1 * 53 0 54 0 55 0 56 3 ** 57 1 * 58 0 59 1 * 60 4 ** 61 4 ** 62 4 ** 63 3 ** 64 4 ** 65 7 **** 66 9 ***** 67 10 ***** 68 13 ******* 69 16 ******** 70 26 ************* 71 18 ********* 72 29 *************** 73 35 ****************** 74 40 ******************** 75 39 ******************** 76 56 **************************** 77 52 ************************** 78 50 ************************* 79 76 ************************************** 80 60 ****************************** 81 77 *************************************** 82 45 *********************** 83 61 ******************************* 84 50 ************************* 85 43 ********************** 86 41 ********************* 87 33 ***************** 88 19 ********** 89 11 ****** 90 11 ****** 91 8 **** 92 2 * 93 9 ***** 94 4 ** 95 9 ***** 96 2 * 97 11 ****** 98 3 ** 99 0
Histograms after:
-> <stat> Ham scores for all runs: 1000 items; mean 18.20; sdev 7.08 -> <stat> min 5.6946; median 17.1757; max 73.1302 * = 2 items ... 5 1 * 6 13 ******* 7 16 ******** 8 25 ************* 9 22 *********** 10 37 ******************* 11 45 *********************** 12 56 **************************** 13 70 *********************************** 14 61 ******************************* 15 66 ********************************* 16 79 **************************************** 17 63 ******************************** 18 59 ****************************** 19 59 ****************************** 20 56 **************************** 21 47 ************************ 22 36 ****************** 23 37 ******************* 24 32 **************** 25 9 ***** 26 20 ********** 27 17 ********* 28 8 **** 29 7 **** 30 11 ****** 31 6 *** 32 7 **** 33 5 *** 34 4 ** 35 2 * 36 2 * 37 6 *** 38 1 * 39 0 40 3 ** 41 3 ** 42 0 43 1 * 44 1 * 45 1 * 46 0 47 1 * 48 0 49 0 50 2 * 51 1 * 52 0 53 0 54 0 55 0 56 0 57 0 58 0 59 0 60 0 61 1 * 62 0 63 0 64 0 65 0 66 0 67 0 68 0 69 0 70 0 71 0 72 0 73 1 *
-> <stat> Spam scores for all runs: 1000 items; mean 88.81; sdev 5.47 -> <stat> min 54.9382; median 89.5188; max 98.3805 * = 2 items ... 54 1 * 55 0 56 0 57 0 58 0 59 0 60 0 61 0 62 0 63 1 * 64 3 ** 65 0 66 1 * 67 0 68 2 * 69 2 * 70 3 ** 71 3 ** 72 2 * 73 2 * 74 4 ** 75 4 ** 76 6 *** 77 8 **** 78 8 **** 79 6 *** 80 12 ****** 81 25 ************* 82 26 ************* 83 25 ************* 84 39 ******************** 85 58 ***************************** 86 70 *********************************** 87 64 ******************************** 88 74 ************************************* 89 106 ***************************************************** 90 85 ******************************************* 91 62 ******************************* 92 86 ******************************************* 93 79 **************************************** 94 37 ******************* 95 23 ************ 96 42 ********************* 97 25 ************* 98 6 *** 99 0
There are snaky tails in either case, but "the middle ground" here is larger, sparser, and still contains the errors.
Across my full test data, which I actually ran first, you can ignore the "won/lost" business; I had spam_cutoff at 0.55 for both runs, and the overall results would have been virtually identical had I boosted spam_cutoff in the second run (recall that I can't demonstrate an improvement on this data anymore! I can only determine whether something is a disaster, and this ain't).
-> <stat> tested 2000 hams & 1400 spams against 18000 hams & 12600 spams [ditto 19 times] ... false positive percentages 0.000 0.050 lost +(was 0) 0.000 0.050 lost +(was 0) 0.000 0.050 lost +(was 0) 0.000 0.000 tied 0.050 0.100 lost +100.00% 0.000 0.000 tied 0.000 0.000 tied 0.000 0.000 tied 0.000 0.000 tied 0.050 0.050 tied
won 0 times tied 6 times lost 4 times
total unique fp went from 2 to 6 lost +200.00% mean fp % went from 0.01 to 0.03 lost +200.00%
false negative percentages 0.000 0.000 tied 0.071 0.071 tied 0.000 0.000 tied 0.071 0.071 tied 0.143 0.071 won -50.35% 0.143 0.000 won -100.00% 0.143 0.143 tied 0.143 0.000 won -100.00% 0.071 0.000 won -100.00% 0.000 0.000 tied
won 4 times tied 6 times lost 0 times
total unique fn went from 11 to 5 won -54.55% mean fn % went from 0.0785714285714 to 0.0357142857143 won -54.55%
ham mean ham sdev 25.65 10.68 -58.36% 5.67 5.44 -4.06% 25.61 10.68 -58.30% 5.50 5.29 -3.82% 25.57 10.68 -58.23% 5.67 5.49 -3.17% 25.66 10.71 -58.26% 5.54 5.27 -4.87% 25.42 10.55 -58.50% 5.72 5.71 -0.17% 25.51 10.43 -59.11% 5.39 5.11 -5.19% 25.65 10.40 -59.45% 5.59 5.29 -5.37% 25.61 10.51 -58.96% 5.41 5.21 -3.70% 25.84 10.80 -58.20% 5.48 5.30 -3.28% 25.81 10.85 -57.96% 5.81 5.73 -1.38%
ham mean and sdev for all runs 25.63 10.63 -58.53% 5.58 5.39 -3.41%
spam mean spam sdev 83.86 93.17 +11.10% 7.09 4.55 -35.83% 83.64 93.16 +11.38% 6.83 4.52 -33.82% 83.27 92.91 +11.58% 6.81 4.52 -33.63% 83.82 93.14 +11.12% 6.88 4.67 -32.12% 83.89 93.29 +11.21% 6.65 4.56 -31.43% 83.78 93.11 +11.14% 6.96 4.72 -32.18% 83.42 93.00 +11.48% 6.82 4.74 -30.50% 83.86 93.29 +11.24% 6.71 4.55 -32.19% 83.88 93.22 +11.13% 6.98 4.71 -32.52% 83.75 93.28 +11.38% 6.65 4.32 -35.04%
spam mean and sdev for all runs 83.72 93.16 +11.28% 6.84 4.59 -32.89%
ham/spam mean difference: 58.09 82.53 +24.44
So the equidistant score changed from 51.73 at 4.68 sdevs from each mean, to 55.20 at 8.27 sdevs from each. That's big.
The "after" histograms had 200 buckets in this run:
-> <stat> Ham scores for all runs: 20000 items; mean 10.63; sdev 5.39 -> <stat> min 0.281945; median 9.69929; max 81.9673 * = 17 items 0.0 7 * 0.5 13 * 1.0 21 ** 1.5 41 *** 2.0 86 ****** 2.5 166 ********** 3.0 239 *************** 3.5 326 ******************** 4.0 466 **************************** 4.5 554 ********************************* 5.0 642 ************************************** 5.5 701 ****************************************** 6.0 793 *********************************************** 6.5 804 ************************************************ 7.0 933 ******************************************************* 7.5 972 ********************************************************** 8.0 997 *********************************************************** 8.5 934 ******************************************************* 9.0 947 ******************************************************** 9.5 939 ******************************************************** 10.0 839 ************************************************** 10.5 786 *********************************************** 11.0 752 ********************************************* 11.5 760 ********************************************* 12.0 636 ************************************** 12.5 606 ************************************ 13.0 554 ********************************* 13.5 483 ***************************** 14.0 461 **************************** 14.5 399 ************************ 15.0 360 ********************** 15.5 317 ******************* 16.0 275 ***************** 16.5 224 ************** 17.0 193 ************ 17.5 169 ********** 18.0 172 *********** 18.5 154 ********** 19.0 153 ********* 19.5 92 ****** 20.0 104 ******* 20.5 99 ****** 21.0 74 ***** 21.5 73 ***** 22.0 73 ***** 22.5 50 *** 23.0 38 *** 23.5 50 *** 24.0 38 *** 24.5 34 ** 25.0 26 ** 25.5 39 *** 26.0 24 ** 26.5 34 ** 27.0 18 ** 27.5 15 * 28.0 20 ** 28.5 15 * 29.0 14 * 29.5 15 * 30.0 12 * 30.5 15 * 31.0 14 * 31.5 10 * 32.0 12 * 32.5 6 * 33.0 10 * 33.5 4 * 34.0 8 * 34.5 5 * 35.0 5 * 35.5 6 * 36.0 7 * 36.5 4 * 37.0 2 * 37.5 3 * 38.0 1 * 38.5 4 * 39.0 6 * 39.5 2 * 40.0 2 * 40.5 5 * 41.0 0 41.5 2 * 42.0 3 * 42.5 3 * 43.0 1 * 43.5 2 * 44.0 1 * 44.5 2 * 45.0 1 * 45.5 1 * 46.0 2 * 46.5 0 47.0 3 * 47.5 0 48.0 1 * 48.5 1 * 49.0 1 * 49.5 0 50.0 1 * 50.5 0 51.0 2 * 51.5 0 52.0 1 * 52.5 0 53.0 0 53.5 1 * 54.0 1 * 54.5 2 * 55.0 0 55.5 0 56.0 1 * 56.5 1 * 57.0 0 57.5 0 58.0 0 58.5 1 * 59.0 0 59.5 0 60.0 0 60.5 0 61.0 1 * 61.5 0 62.0 0 62.5 0 63.0 0 63.5 0 64.0 0 64.5 0 65.0 0 65.5 0 66.0 0 66.5 0 67.0 0 67.5 0 68.0 0 68.5 0 69.0 0 69.5 0 70.0 1 * the lady with the long & obnoxious employer-generated sig 70.5 0 71.0 0 71.5 0 72.0 0 72.5 0 73.0 0 73.5 0 74.0 0 74.5 0 75.0 0 75.5 0 76.0 0 76.5 0 77.0 0 77.5 0 78.0 0 78.5 0 79.0 0 79.5 0 80.0 0 80.5 0 81.0 0 81.5 1 * the verbatim quote of a long Nigerian-scam spam ...
-> <stat> Spam scores for all runs: 14000 items; mean 93.16; sdev 4.59 -> <stat> min 24.3497; median 93.8141; max 99.6769 * = 15 items ... 24.0 1 * not really sure -- it's a giant base64-encoded plain text file 24.5 0 25.0 0 25.5 0 26.0 0 26.5 0 27.0 0 27.5 0 28.0 0 28.5 0 29.0 1 * the spam with the uuencoded body we throw away 29.5 0 30.0 0 30.5 0 31.0 0 31.5 0 32.0 0 32.5 0 33.0 0 33.5 0 34.0 0 34.5 0 35.0 0 35.5 0 36.0 0 36.5 0 37.0 0 37.5 0 38.0 0 38.5 0 39.0 0 39.5 0 40.0 0 40.5 0 41.0 0 41.5 0 42.0 0 42.5 0 43.0 0 43.5 0 44.0 0 44.5 0 45.0 0 45.5 0 46.0 1 * Hello, my Name is BlackIntrepid 46.5 0 47.0 0 47.5 0 48.0 0 48.5 0 49.0 0 49.5 0 50.0 0 50.5 0 51.0 0 51.5 0 52.0 0 52.5 0 53.0 0 53.5 1 * unclear; a collection of webmaster links 54.0 1 * Susan makes a propsal (sic) to Tim 54.5 0 55.0 1 * 55.5 0 56.0 0 56.5 1 * 57.0 2 * 57.5 0 58.0 0 58.5 1 * 59.0 0 59.5 0 60.0 1 * 60.5 2 * 61.0 1 * 61.5 1 * 62.0 0 62.5 1 * 63.0 1 * 63.5 0 64.0 1 * 64.5 1 * 65.0 0 65.5 1 * 66.0 1 * 66.5 2 * 67.0 4 * 67.5 2 * 68.0 0 68.5 1 * 69.0 0 69.5 3 * 70.0 1 * 70.5 5 * 71.0 5 * 71.5 3 * 72.0 4 * 72.5 3 * 73.0 3 * 73.5 6 * 74.0 3 * 74.5 4 * 75.0 8 * 75.5 8 * 76.0 10 * 76.5 10 * 77.0 10 * 77.5 17 ** 78.0 14 * 78.5 27 ** 79.0 16 ** 79.5 23 ** 80.0 28 ** 80.5 29 ** 81.0 37 *** 81.5 37 *** 82.0 46 **** 82.5 55 **** 83.0 47 **** 83.5 53 **** 84.0 58 **** 84.5 68 ***** 85.0 86 ****** 85.5 118 ******** 86.0 135 ********* 86.5 159 *********** 87.0 165 *********** 87.5 178 ************ 88.0 209 ************** 88.5 231 **************** 89.0 299 ******************** 89.5 391 *************************** 90.0 425 ***************************** 90.5 402 *************************** 91.0 501 ********************************** 91.5 582 *************************************** 92.0 636 ******************************************* 92.5 667 ********************************************* 93.0 713 ************************************************ 93.5 685 ********************************************** 94.0 610 ***************************************** 94.5 621 ****************************************** 95.0 721 ************************************************* 95.5 735 ************************************************* 96.0 870 ********************************************************** 96.5 742 ************************************************** 97.0 449 ****************************** 97.5 447 ****************************** 98.0 556 ************************************** 98.5 561 ************************************** 99.0 264 ****************** 99.5 171 ************
The mistakes are all familiar; the good news is that "the normal cases" are far removed from what might plausibly be called a middle ground. For example, if we called the region from 40 thru 70 here "the middle ground", and kicked those out for manual review, there would be very few msgs to review, but they would contain almost all the mistakes.
How does this do on your data? I'm in favor what works <wink>.