[Spambayes] RE: Central Limit Theorem??!! :)

Tim Peters tim.one@comcast.net
Tue, 24 Sep 2002 16:38:11 -0400


[Gary Robinson]
> Summarizing all the test results, I get
>
>          fn  fp
> f(w):    11  6
> Graham:  16  10
> lcm:     29  3
>
> where f(w) is pure f(w) with a=0.1, 1500 discrimators, and
> robinson_minimum_prob_strength=0.1, Graham is tweaked Graham, and lcm is
> log-central-mean ignoring the middling values.

Good summary.  Note that of the first (f(w)) result, I said:

>> Alas, it turns out it would have been better to set spam_cutoff to 0.6
>> on this run.  That would have cut 9 instances of fp and added 33
>> instances of fn (note that since each msg is scored 4 times, it's *not*
>> necessarily the vase that these deltas would reflect directly in the
>> "total unique" counts), leaving the two roughly comparable on this test.

I don't know what the "total unique" counts would have been then.  The
biggest practical problem we've got with the f(w) scheme is that each person
who tests it seems to need a slightly different value of spam_cutoff to work
best (all reported "best values" have been in [0.5, 0.6] to date, and a
change of 0.01 is sometimes enough to change a "looks like it lost" outcome
to a "looks like it won" outcome).

> So doesn't the f(w) technique beat Graham all-around? At least in this
> latest round of testing??? Or am I misreading the  results?

Leaving spam_cutoff at 0.575 in the f(w) run did beat both Graham numbers on
this, yes.  I wanted to boost it to 0.6 to combat the f-p rate regression
f(w) showed compared to the lcm scheme.  The absolute difference is 4 (10-6)
or 5 (16-11) messages out of 20000 predictions, though, and so I call them
approximately equal -- the delta is small compared to total predictions
made.

> I wonder if the errors for lcm are mostly in the region where there are a
> small number of data points, such that the central mean theorem
> isn't really kicking in.
>
> That is, it may be that there is some number n for which f(w)
> gives the best results if the number of non-middling words is < n and
> lcm gives the best results if the number of non-middling words is > n.
>
> That WOULD make a lot of theoretical sense, because for small
> enough n, the central mean theorem is meaningless and can only make
> trouble for us.
>
> Something else for you to test in your copious free time. ;)

I have complete listings of all misclassifications for all tests reported,
but not available to me here and now.  I'll check it out later.  The
historical problems I had with the f-n rate under the Graham scheme involved
very long and very short msgs.  One effect of ignoring middling words is
actually to decrease n on short msgs, and sometimes to decrease n a lot.
Whatever damage is done to the central limit theorem then seems more than
compensated for by not sucking in words that try to drag the z-values 15
stddevs away from both the ham and spam means.  Only words with conviction
should vote <wink>.