[Spambayes] spamprob combining

Tim Peters tim.one@comcast.net
Wed, 09 Oct 2002 23:08:03 -0400


[Gary Robinson]
> The thing about the geometric mean is that it is much more sensitive to
> numbers near 0, so the S/(S+H) technique is biased in that way.

A single geometric mean would surely be biased, but the combination of two
used here doesn't appear to be.  That is, throwing random data at it, the
mean and median are 0.5, and it's symmetric around that:

5000 items; mean 0.50; sdev 0.06
-> <stat> min 0.291521; median 0.500264; max 0.726668
* = 24 items
0.25    3 *
0.30   34 **
0.35  211 *********
0.40  816 **********************************
0.45 1431 ************************************************************
0.50 1442 *************************************************************
0.55  809 **********************************
0.60  219 **********
0.65   33 **
0.70    2 *

If I do the same random-data experiment and force a prob of 0.99, the mean
rises to 0.52; if I force a prob of 0.01, it falls to 0.48.  If there's a
bias, it's hiding pretty well <wink>.

If there is a spamprob near 0, it's very much the intent that S take that
seriously, and if one near 1, that H take that seriously; else, as now, I
see screaming spam or screaming ham barely cracking scores above 70 or below
30.  "Too much ends up in the middle."

> If you want to try something like that, I would suggest using the
> ARITHMETIC means in computing S and H and again using S(S+H).  That
> would remove that bias.

That doesn't appear promising:

If
   S = Smean = (sum p_i)/n

and
   H = Hmean = (sum 1-p_i)/n

then Hmean = n/n - Smean = 1 - Smean, and Smean + Hmean = 1.  So whether you
meant S*(S+H) or S/(S+H), the result is S.  To within roundoff error, that's
what happens, too.

> It wouldn't be invoking that optimality theorem, but whatever works...

I'm not sure the optimality theorem in question is relevant to the task at
hand, though.  Why should we care abour rejecting a hypothesis that the word
probabilities are uniformly distributed?  There's virtually no message in
which they are, and no reason to believe that the *majority* of words in
spam will have spamprobs over 0.5.  Graham got results as good as he did
because the spamprob strength of a mere handful of words is usually enough
to decide it.  In a sense, I am trying to move back toward what worked best
in his formulation.

> It really seems, as a matter of being educated, that the
> arithmetic approach is worth trying if it doesn't take a lot of
> trouble to try it.

Nope, no trouble, but my test data can't demonstrate improvements, just
disasters.  On a brief 10-fold cv run with 100 ham + 100 spam in each set,
using the arithmetic spamprob mean gave results pretty much the same as the
default scheme; error rates were the same, but the best range for
spam_cutoff shifted from 0.52 thru 0.54, to 0.56 thru 0.58; it increased the
spread a little:

ham mean and sdev for all runs
  30.35   30.53   +0.59%        5.83    5.91   +1.37%

spam mean and sdev for all runs
  80.97   84.08   +3.84%        7.07    6.38   -9.76%

ham/spam mean difference: 50.62 53.55 +2.93

>> "but more sensitive to overwhelming amounts of evidence than
>> Gary-combining"

> From the email you sent at 1:02PM yesterday:
>
> 0.40    0
> 0.45    2 *
> 0.50  412 *********
> 0.55 3068 *************************************************************
> 0.60 1447 *****************************
> 0.65   71 **
> 0.70    0
>
> One thing I'd like to be more clear on. If I understand the experiment
> correctly you set 10 to .99 and 40 were random.

I have to dig up that email to find the context ... OK, this one was tagged

    Result for random vectors of 50 probs, + 10 forced to 0.99

That means there were 60 probs in all, 50 drawn from (0.0, 1.0), + 10 of
0.99.

> What percentage actually ended up as > .5, without regard to
> HOW MUCH over .5?

>From the histogram, all but 2, out of 5000 trials.  0.5 doesn't work as a
spam_cutoff on anyone's corpus here, though (it's too low; too many false
positives).  The median value in that run was 0.58555, which is close to
what some people have been using for spam_cutoff.

Under the S/(S+H) scheme, the same experiment yields

5000 items; mean 0.68; sdev 0.05
-> <stat> min 0.490773; median 0.683328; max 0.819528
* = 34 items
0.45    2 *
0.50   27 *
0.55  171 ******
0.60  991 ******************************
0.65 2016 ************************************************************
0.70 1510 *********************************************
0.75  275 *********
0.80    8 *

So if the percentage above 0.5 is sole the measure of goodness here, S/(S+H)
did equally well in this experiement.

> ...
> It's not the (S-H)/(S+H) that is the most sensitive (under certain
> conditions), it that the geometric mean approach for computing S gives a
> result that is MONOTONIC WITH a calculation which is the most sensitive.
>
> The real technique would take S and feed it into an inverse chi-square
> function with (in this experiment) 100 degrees of freedom. The output
> (roughly speaking) would be the probability that that S (or a more extreme
> one) might have occurred by chance alone.
>
> Call these numbers S' and H' for S and H respectively.
>
> The calculation (S-H)/(S+H) will be > 0 if and only if (S'-H')/(S'+H')
> (unless I've made some error).
>
> So, as a binary indicator, the two are equivalent. However, if you used S'
> and H', you would see something more like real probabilities that would
> probably be of magnitudes that would be more attractive to you.
>
> You could probably use a table to approximate the inverse chi-square calc
> rather than actually doing the computations all the time.
>
> I didn't suggest doing that, at first, because I was interested
> in providing a binary indicator and wanting to keep things simple --
> and from the POV of a binary indicator, it doesn't make any difference.

It's not a question of attraction <wink> so much as that this "binary
indicator" doesn't come with a decision rule for knowing which outcome is
which:  it varies across corpus, and within a given corpus varies over time,
depending on how much data has been trained on.  So we get a stream of test
results where the numbers have to be fudged retroactively via "but if I had
set the cutoff to *this* on this run, the results would have been very
different".  It's just too delicate as is.

> So, if it happens that feel like taking the time to go "all the way"
> with this approach, I would suggest actually computing S' and H' and
> seeing what happens.

Sounds like fun.

> I think you would like the results better -- I just didn't suggest
> it at first because I didn't know the spread would be of such
> interest and I wanted to keep things simple.

That's fine.  In practice, the touchiness of spam_cutoff has been an ongoing
practical problem; but it's been the *only* ongoing problem, so that's why
we're talking about it <wink>>

> I think this would work better than the S/(S+H) approach, because
> if you use geometric means, it's more sensitive to one condition than
> the other, and if you use arithmetic means, you don't invoke the
> optimality theorem.

As above, I've found no reason yet to believe S/(S+H) favors one side over
the other, and the test runs didn't show me evidence of that either.
Indeed, it made the same mistakes on the same messages, but moved mounds of
correctly classified message out of "the middle ground".

> Of course, this is ALL speculative. But the probabilities involved will
> DEFINATELY be of greater magnitude, and so a better-defined spread, if
> the inverse chi-square is used.

It's doable, but the experimental results so far are promising enough that
I'm still keener to see how it works for others here.