[Spambayes] RE: Central Limit Theorem??!! :)

Gary Robinson grobinson@transpose.com
Tue, 24 Sep 2002 07:31:20 -0400


> All of Gary's schemes beat the fiddled Graham scheme on fp rate here, but
> the Graham scheme remains unbeaten on the fn rate.  (I should add, of
> course, "as implemented by Tim" to all of those.)

Summarizing all the test results, I get

         fn  fp
f(w):    11  6
Graham:  16  10
lcm:     29  3

where f(w) is pure f(w) with a=0.1, 1500 discrimators, and
robinson_minimum_prob_strength=0.1, Graham is tweaked Graham, and lcm is
log-central-mean ignoring the middling values.

So doesn't the f(w) technique beat Graham all-around? At least in this
latest round of testing??? Or am I misreading the  results?

I wonder if the errors for lcm are mostly in the region where there are a
small number of data points, such that the central mean theorem isn't really
kicking in.

That is, it may be that there is some number n for which f(w) gives the best
results if the number of non-middling words is < n and and lcm gives the
best results if the number of non-middling words is > n.

That WOULD make a lot of theoretical sense, because for small enough n, the
central mean theorem is meaningless and can only make trouble for us.

Something else for you to test in your copious free time. ;)


--Gary


-- 
Gary Robinson
CEO
Transpose, LLC
grobinson@transpose.com
207-942-3463
http://www.emergentmusic.com
http://radio.weblogs.com/0101454


> From: Tim Peters <tim.one@comcast.net>
> Date: Mon, 23 Sep 2002 23:22:13 -0400
> To: Gary Robinson <grobinson@transpose.com>
> Cc: SpamBayes <spambayes@python.org>, Greg Louis <glouis@dynamicro.on.ca>
> Subject: RE: [Spambayes] RE: Central Limit Theorem??!!     :)
> 
> [Gary Robinson, outlines another central limit scheme, but using
> ln(f(w)) for spam scores and ln(1-f(w)) for ham scores when
> building the ham and spam "extreme word" populations, then computing
> two corresponding log sums when scoring.
> ]
> 
>> ...
>> This *should* bring us the benefits of the multiplicative approach
>> and of the parametric stats approach at the same time.
> 
> My first "real test" says it's a winner!  See below.
> 
>> The downside of this is that the ln's make such a skewed distribution
>> that it should take a bigger n to make the central limit theorem kick
>> in.
> 
> This isn't a good time to mention it <wink>, but the original central-limit
> code worked better when I reduced the # of extremes from 30 to 15.  Note
> that these are the kinds of oddities I spent countless hours "tuning away"
> on the Graham scheme, though, and I;ve still done almost nothing of that
> nature for your schemes.
> 
>> BUT, OTOH, it WILL kick in, and something like 30 may really still
>> be enough (it usually kicks in at significantly smaller numbers).
> 
> I predict more experiments on the horizon.
> 
>> And you've also successfully used n=150 with f(w)
> 
> I've also used n=15 and n=1500, BTW.  150->1500 neither helped nor hurt.
> 150->15 gave mixed results, too close to call across the few small tests I
> ran.
> 
>> and that is DEFINITELY enough.
>> 
>> The other downside is that it just seems like a bit  of a wild thing
> < to do and sometimes when you do wild things, strange reasons emerge
>> why they won't work.
> 
> It's an extreme problem, though.  I ran experiments long ago training on a
> single ham and a single ham:  that was enough to get about 30% fp rate and
> about 20% fn rate.  *Much* better than flipping a coin!  Training on two of
> each did even better <wink>.
> 
> But I really can't see any at this point as long as n is big enough
>> that the sample means take on a normal distribution.
>> 
>> THANKS for doing all the coding work to test this idea!!!!  :)
> 
> Thank me for the testing, if you have to thank me at all -- the coding is
> fun, the testing is mind-numbing <drool>.
> 
> Speaking of which, the 3-pass nature of training under this scheme doesn't
> work well with my cross-validation driver.  Instead the test here is on a
> 5x5 test grid, skipping the diagonal.  I picked 5 disjoint sets of 1000 ham
> each at random, and 5 disjoint sets of 1000 spam each at random.  These were
> then paired into 5 collections, 1 thru 5.  First I trained on pair 1, then
> predicted on pairs 2, 3, 4, and 5.  Then I trained a new classifier on pair
> 2, and predicted on pairs 1, 3, 4, and 5.  Etc.  In all we get 20 test runs
> out of this, and each message is predicted 4 times.
> 
> In both runs, 30 extremes were used, and f(w) with a=0.1.  The original
> central limit code is on the left, and the logarithmic version on the right.
> I haven't done anything to do a more sensible scaling of scores into [0, 1],
> so I'll skip histograms.
> 
> -> <stat> tested 1000 hams & 1000 spams against 1000 hams & 1000 spams
>  [ditto 39 times]
> 
> false positive percentages
>   0.000  0.000  tied
>   0.000  0.000  tied
>   0.000  0.000  tied
>   0.200  0.200  tied
>   0.000  0.000  tied
>   0.000  0.000  tied
>   0.000  0.000  tied
>   0.100  0.100  tied
>   0.000  0.000  tied
>   0.000  0.000  tied
>   0.000  0.000  tied
>   0.100  0.100  tied
>   0.000  0.000  tied
>   0.000  0.000  tied
>   0.100  0.000  won   -100.00%
>   0.100  0.100  tied
>   0.000  0.000  tied
>   0.000  0.000  tied
>   0.000  0.000  tied
>   0.000  0.000  tied
> 
> won   1 times
> tied 19 times
> lost  0 times
> 
> total unique fp went from 3 to 2 won    -33.33%
> mean fp % went from 0.03 to 0.025 won    -16.67%
> 
> false negative percentages
>   2.200  0.800  won    -63.64%
>   2.100  0.900  won    -57.14%
>   1.700  0.600  won    -64.71%
>   1.900  0.400  won    -78.95%
>   1.500  1.100  won    -26.67%
>   1.900  0.700  won    -63.16%
>   1.600  0.600  won    -62.50%
>   1.600  0.700  won    -56.25%
>   1.400  0.600  won    -57.14%
>   1.900  0.400  won    -78.95%
>   1.600  0.700  won    -56.25%
>   1.600  0.400  won    -75.00%
>   1.900  0.600  won    -68.42%
>   1.600  0.500  won    -68.75%
>   1.300  0.500  won    -61.54%
>   1.600  0.300  won    -81.25%
>   1.300  0.400  won    -69.23%
>   1.900  0.400  won    -78.95%
>   1.300  0.400  won    -69.23%
>   1.400  0.300  won    -78.57%
> 
> won  20 times
> tied  0 times
> lost  0 times
> 
> total unique fn went from 125 to 54 won    -56.80%
> mean fn % went from 1.665 to 0.565 won    -66.07%
> 
> So the logarithmic version was indeed a huge and pure win for the fn rate!
> 
> There are two other comparisons I need to make here:  against your webpage
> scheme (f(w) with "S"), and against our highly fiddled Graham scheme.
> 
> First the same thing (same test msgs, etc), with log central-limit on the
> left, and f(w) with a=0.1, 1500 discrimators, and
> robinson_minimum_prob_strength=0.1 (a gimmick that simply ignores all words
> with spamprob in (0.4, 0.6)) on the right:
> 
> [Classifier]
> use_robinson_probability: True
> use_robinson_combining: True
> max_discriminators: 1500
> robinson_minimum_prob_strength: 0.1
> robinson_probability_a: 0.1
> use_central_limit: False
> 
> [TestDriver]
> spam_cutoff: 0.575
> 
> false positive percentages
>   0.000  0.000  tied
>   0.000  0.100  lost  +(was 0)
>   0.000  0.000  tied
>   0.200  0.200  tied
>   0.000  0.000  tied
>   0.000  0.100  lost  +(was 0)
>   0.000  0.000  tied
>   0.100  0.300  lost  +200.00%
>   0.000  0.000  tied
>   0.000  0.100  lost  +(was 0)
>   0.000  0.000  tied
>   0.100  0.200  lost  +100.00%
>   0.000  0.000  tied
>   0.000  0.100  lost  +(was 0)
>   0.000  0.100  lost  +(was 0)
>   0.100  0.200  lost  +100.00%
>   0.000  0.000  tied
>   0.000  0.100  lost  +(was 0)
>   0.000  0.100  lost  +(was 0)
>   0.000  0.000  tied
> 
> won   0 times
> tied 10 times
> lost 10 times
> 
> total unique fp went from 2 to 6 lost  +200.00%
> mean fp % went from 0.025 to 0.08 lost  +220.00%
> 
> false negative percentages
>   0.800  0.100  won    -87.50%
>   0.900  0.000  won   -100.00%
>   0.600  0.200  won    -66.67%
>   0.400  0.200  won    -50.00%
>   1.100  0.100  won    -90.91%
>   0.700  0.100  won    -85.71%
>   0.600  0.200  won    -66.67%
>   0.700  0.100  won    -85.71%
>   0.600  0.000  won   -100.00%
>   0.400  0.000  won   -100.00%
>   0.700  0.200  won    -71.43%
>   0.400  0.200  won    -50.00%
>   0.600  0.100  won    -83.33%
>   0.500  0.100  won    -80.00%
>   0.500  0.000  won   -100.00%
>   0.300  0.100  won    -66.67%
>   0.400  0.100  won    -75.00%
>   0.400  0.100  won    -75.00%
>   0.400  0.000  won   -100.00%
>   0.300  0.300  tied
> 
> won  19 times
> tied  1 times
> lost  0 times
> 
> total unique fn went from 54 to 11 won    -79.63%
> mean fn % went from 0.565 to 0.11 won    -80.53%
> 
> Alas, it turns out it would have been better to set spam_cutoff to 0.6 on
> this run.  That would have cut 9 instances of fp and added 33 instances of
> fn (note that since each msg is scored 4 times, it's *not* necessarily the
> case that these deltas would reflect directly in the "total unique" counts),
> leaving the two roughly comparable on this test.
> 
> I predict (but haven't yet tried it) that I'd get an improvement in the
> central-limit scheme by systematically ignoring low-value words too.
> 
> Finally, the same thing with log central limit on the left, and our Graham
> scheme on the right (16 discriminators, spam cutoff 0.90).
> 
> false positive percentages
>   0.000  0.000  tied
>   0.000  0.300  lost  +(was 0)
>   0.000  0.000  tied
>   0.200  0.200  tied
>   0.000  0.000  tied
>   0.000  0.100  lost  +(was 0)
>   0.000  0.000  tied
>   0.100  0.200  lost  +100.00%
>   0.000  0.000  tied
>   0.000  0.100  lost  +(was 0)
>   0.000  0.000  tied
>   0.100  0.200  lost  +100.00%
>   0.000  0.100  lost  +(was 0)
>   0.000  0.100  lost  +(was 0)
>   0.000  0.100  lost  +(was 0)
>   0.100  0.300  lost  +200.00%
>   0.000  0.100  lost  +(was 0)
>   0.000  0.000  tied
>   0.000  0.300  lost  +(was 0)
>   0.000  0.000  tied
> 
> won   0 times
> tied  9 times
> lost 11 times
> 
> total unique fp went from 2 to 10 lost  +400.00%
> mean fp % went from 0.025 to 0.105 lost  +320.00%
> 
> false negative percentages
>   0.800  0.100  won    -87.50%
>   0.900  0.100  won    -88.89%
>   0.600  0.300  won    -50.00%
>   0.400  0.200  won    -50.00%
>   1.100  0.300  won    -72.73%
>   0.700  0.400  won    -42.86%
>   0.600  0.300  won    -50.00%
>   0.700  0.100  won    -85.71%
>   0.600  0.100  won    -83.33%
>   0.400  0.000  won   -100.00%
>   0.700  0.200  won    -71.43%
>   0.400  0.200  won    -50.00%
>   0.600  0.100  won    -83.33%
>   0.500  0.100  won    -80.00%
>   0.500  0.100  won    -80.00%
>   0.300  0.200  won    -33.33%
>   0.400  0.100  won    -75.00%
>   0.400  0.100  won    -75.00%
>   0.400  0.200  won    -50.00%
>   0.300  0.200  won    -33.33%
> 
> won  20 times
> tied  0 times
> lost  0 times
> 
> total unique fn went from 54 to 16 won    -70.37%
> mean fn % went from 0.565 to 0.17 won    -69.91%
> 
> All of Gary's schemes beat the fiddled Graham scheme on fp rate here, but
> the Graham scheme remains unbeaten on the fn rate.  (I should add, of
> course, "as implemented by Tim" to all of those.)
>