[Spambayes] RE: Central Limit Theorem??!! :)
Gary Robinson
grobinson@transpose.com
Tue, 24 Sep 2002 07:31:20 -0400
> All of Gary's schemes beat the fiddled Graham scheme on fp rate here, but
> the Graham scheme remains unbeaten on the fn rate. (I should add, of
> course, "as implemented by Tim" to all of those.)
Summarizing all the test results, I get
fn fp
f(w): 11 6
Graham: 16 10
lcm: 29 3
where f(w) is pure f(w) with a=0.1, 1500 discrimators, and
robinson_minimum_prob_strength=0.1, Graham is tweaked Graham, and lcm is
log-central-mean ignoring the middling values.
So doesn't the f(w) technique beat Graham all-around? At least in this
latest round of testing??? Or am I misreading the results?
I wonder if the errors for lcm are mostly in the region where there are a
small number of data points, such that the central mean theorem isn't really
kicking in.
That is, it may be that there is some number n for which f(w) gives the best
results if the number of non-middling words is < n and and lcm gives the
best results if the number of non-middling words is > n.
That WOULD make a lot of theoretical sense, because for small enough n, the
central mean theorem is meaningless and can only make trouble for us.
Something else for you to test in your copious free time. ;)
--Gary
--
Gary Robinson
CEO
Transpose, LLC
grobinson@transpose.com
207-942-3463
http://www.emergentmusic.com
http://radio.weblogs.com/0101454
> From: Tim Peters <tim.one@comcast.net>
> Date: Mon, 23 Sep 2002 23:22:13 -0400
> To: Gary Robinson <grobinson@transpose.com>
> Cc: SpamBayes <spambayes@python.org>, Greg Louis <glouis@dynamicro.on.ca>
> Subject: RE: [Spambayes] RE: Central Limit Theorem??!! :)
>
> [Gary Robinson, outlines another central limit scheme, but using
> ln(f(w)) for spam scores and ln(1-f(w)) for ham scores when
> building the ham and spam "extreme word" populations, then computing
> two corresponding log sums when scoring.
> ]
>
>> ...
>> This *should* bring us the benefits of the multiplicative approach
>> and of the parametric stats approach at the same time.
>
> My first "real test" says it's a winner! See below.
>
>> The downside of this is that the ln's make such a skewed distribution
>> that it should take a bigger n to make the central limit theorem kick
>> in.
>
> This isn't a good time to mention it <wink>, but the original central-limit
> code worked better when I reduced the # of extremes from 30 to 15. Note
> that these are the kinds of oddities I spent countless hours "tuning away"
> on the Graham scheme, though, and I;ve still done almost nothing of that
> nature for your schemes.
>
>> BUT, OTOH, it WILL kick in, and something like 30 may really still
>> be enough (it usually kicks in at significantly smaller numbers).
>
> I predict more experiments on the horizon.
>
>> And you've also successfully used n=150 with f(w)
>
> I've also used n=15 and n=1500, BTW. 150->1500 neither helped nor hurt.
> 150->15 gave mixed results, too close to call across the few small tests I
> ran.
>
>> and that is DEFINITELY enough.
>>
>> The other downside is that it just seems like a bit of a wild thing
> < to do and sometimes when you do wild things, strange reasons emerge
>> why they won't work.
>
> It's an extreme problem, though. I ran experiments long ago training on a
> single ham and a single ham: that was enough to get about 30% fp rate and
> about 20% fn rate. *Much* better than flipping a coin! Training on two of
> each did even better <wink>.
>
> But I really can't see any at this point as long as n is big enough
>> that the sample means take on a normal distribution.
>>
>> THANKS for doing all the coding work to test this idea!!!! :)
>
> Thank me for the testing, if you have to thank me at all -- the coding is
> fun, the testing is mind-numbing <drool>.
>
> Speaking of which, the 3-pass nature of training under this scheme doesn't
> work well with my cross-validation driver. Instead the test here is on a
> 5x5 test grid, skipping the diagonal. I picked 5 disjoint sets of 1000 ham
> each at random, and 5 disjoint sets of 1000 spam each at random. These were
> then paired into 5 collections, 1 thru 5. First I trained on pair 1, then
> predicted on pairs 2, 3, 4, and 5. Then I trained a new classifier on pair
> 2, and predicted on pairs 1, 3, 4, and 5. Etc. In all we get 20 test runs
> out of this, and each message is predicted 4 times.
>
> In both runs, 30 extremes were used, and f(w) with a=0.1. The original
> central limit code is on the left, and the logarithmic version on the right.
> I haven't done anything to do a more sensible scaling of scores into [0, 1],
> so I'll skip histograms.
>
> -> <stat> tested 1000 hams & 1000 spams against 1000 hams & 1000 spams
> [ditto 39 times]
>
> false positive percentages
> 0.000 0.000 tied
> 0.000 0.000 tied
> 0.000 0.000 tied
> 0.200 0.200 tied
> 0.000 0.000 tied
> 0.000 0.000 tied
> 0.000 0.000 tied
> 0.100 0.100 tied
> 0.000 0.000 tied
> 0.000 0.000 tied
> 0.000 0.000 tied
> 0.100 0.100 tied
> 0.000 0.000 tied
> 0.000 0.000 tied
> 0.100 0.000 won -100.00%
> 0.100 0.100 tied
> 0.000 0.000 tied
> 0.000 0.000 tied
> 0.000 0.000 tied
> 0.000 0.000 tied
>
> won 1 times
> tied 19 times
> lost 0 times
>
> total unique fp went from 3 to 2 won -33.33%
> mean fp % went from 0.03 to 0.025 won -16.67%
>
> false negative percentages
> 2.200 0.800 won -63.64%
> 2.100 0.900 won -57.14%
> 1.700 0.600 won -64.71%
> 1.900 0.400 won -78.95%
> 1.500 1.100 won -26.67%
> 1.900 0.700 won -63.16%
> 1.600 0.600 won -62.50%
> 1.600 0.700 won -56.25%
> 1.400 0.600 won -57.14%
> 1.900 0.400 won -78.95%
> 1.600 0.700 won -56.25%
> 1.600 0.400 won -75.00%
> 1.900 0.600 won -68.42%
> 1.600 0.500 won -68.75%
> 1.300 0.500 won -61.54%
> 1.600 0.300 won -81.25%
> 1.300 0.400 won -69.23%
> 1.900 0.400 won -78.95%
> 1.300 0.400 won -69.23%
> 1.400 0.300 won -78.57%
>
> won 20 times
> tied 0 times
> lost 0 times
>
> total unique fn went from 125 to 54 won -56.80%
> mean fn % went from 1.665 to 0.565 won -66.07%
>
> So the logarithmic version was indeed a huge and pure win for the fn rate!
>
> There are two other comparisons I need to make here: against your webpage
> scheme (f(w) with "S"), and against our highly fiddled Graham scheme.
>
> First the same thing (same test msgs, etc), with log central-limit on the
> left, and f(w) with a=0.1, 1500 discrimators, and
> robinson_minimum_prob_strength=0.1 (a gimmick that simply ignores all words
> with spamprob in (0.4, 0.6)) on the right:
>
> [Classifier]
> use_robinson_probability: True
> use_robinson_combining: True
> max_discriminators: 1500
> robinson_minimum_prob_strength: 0.1
> robinson_probability_a: 0.1
> use_central_limit: False
>
> [TestDriver]
> spam_cutoff: 0.575
>
> false positive percentages
> 0.000 0.000 tied
> 0.000 0.100 lost +(was 0)
> 0.000 0.000 tied
> 0.200 0.200 tied
> 0.000 0.000 tied
> 0.000 0.100 lost +(was 0)
> 0.000 0.000 tied
> 0.100 0.300 lost +200.00%
> 0.000 0.000 tied
> 0.000 0.100 lost +(was 0)
> 0.000 0.000 tied
> 0.100 0.200 lost +100.00%
> 0.000 0.000 tied
> 0.000 0.100 lost +(was 0)
> 0.000 0.100 lost +(was 0)
> 0.100 0.200 lost +100.00%
> 0.000 0.000 tied
> 0.000 0.100 lost +(was 0)
> 0.000 0.100 lost +(was 0)
> 0.000 0.000 tied
>
> won 0 times
> tied 10 times
> lost 10 times
>
> total unique fp went from 2 to 6 lost +200.00%
> mean fp % went from 0.025 to 0.08 lost +220.00%
>
> false negative percentages
> 0.800 0.100 won -87.50%
> 0.900 0.000 won -100.00%
> 0.600 0.200 won -66.67%
> 0.400 0.200 won -50.00%
> 1.100 0.100 won -90.91%
> 0.700 0.100 won -85.71%
> 0.600 0.200 won -66.67%
> 0.700 0.100 won -85.71%
> 0.600 0.000 won -100.00%
> 0.400 0.000 won -100.00%
> 0.700 0.200 won -71.43%
> 0.400 0.200 won -50.00%
> 0.600 0.100 won -83.33%
> 0.500 0.100 won -80.00%
> 0.500 0.000 won -100.00%
> 0.300 0.100 won -66.67%
> 0.400 0.100 won -75.00%
> 0.400 0.100 won -75.00%
> 0.400 0.000 won -100.00%
> 0.300 0.300 tied
>
> won 19 times
> tied 1 times
> lost 0 times
>
> total unique fn went from 54 to 11 won -79.63%
> mean fn % went from 0.565 to 0.11 won -80.53%
>
> Alas, it turns out it would have been better to set spam_cutoff to 0.6 on
> this run. That would have cut 9 instances of fp and added 33 instances of
> fn (note that since each msg is scored 4 times, it's *not* necessarily the
> case that these deltas would reflect directly in the "total unique" counts),
> leaving the two roughly comparable on this test.
>
> I predict (but haven't yet tried it) that I'd get an improvement in the
> central-limit scheme by systematically ignoring low-value words too.
>
> Finally, the same thing with log central limit on the left, and our Graham
> scheme on the right (16 discriminators, spam cutoff 0.90).
>
> false positive percentages
> 0.000 0.000 tied
> 0.000 0.300 lost +(was 0)
> 0.000 0.000 tied
> 0.200 0.200 tied
> 0.000 0.000 tied
> 0.000 0.100 lost +(was 0)
> 0.000 0.000 tied
> 0.100 0.200 lost +100.00%
> 0.000 0.000 tied
> 0.000 0.100 lost +(was 0)
> 0.000 0.000 tied
> 0.100 0.200 lost +100.00%
> 0.000 0.100 lost +(was 0)
> 0.000 0.100 lost +(was 0)
> 0.000 0.100 lost +(was 0)
> 0.100 0.300 lost +200.00%
> 0.000 0.100 lost +(was 0)
> 0.000 0.000 tied
> 0.000 0.300 lost +(was 0)
> 0.000 0.000 tied
>
> won 0 times
> tied 9 times
> lost 11 times
>
> total unique fp went from 2 to 10 lost +400.00%
> mean fp % went from 0.025 to 0.105 lost +320.00%
>
> false negative percentages
> 0.800 0.100 won -87.50%
> 0.900 0.100 won -88.89%
> 0.600 0.300 won -50.00%
> 0.400 0.200 won -50.00%
> 1.100 0.300 won -72.73%
> 0.700 0.400 won -42.86%
> 0.600 0.300 won -50.00%
> 0.700 0.100 won -85.71%
> 0.600 0.100 won -83.33%
> 0.400 0.000 won -100.00%
> 0.700 0.200 won -71.43%
> 0.400 0.200 won -50.00%
> 0.600 0.100 won -83.33%
> 0.500 0.100 won -80.00%
> 0.500 0.100 won -80.00%
> 0.300 0.200 won -33.33%
> 0.400 0.100 won -75.00%
> 0.400 0.100 won -75.00%
> 0.400 0.200 won -50.00%
> 0.300 0.200 won -33.33%
>
> won 20 times
> tied 0 times
> lost 0 times
>
> total unique fn went from 54 to 16 won -70.37%
> mean fn % went from 0.565 to 0.17 won -69.91%
>
> All of Gary's schemes beat the fiddled Graham scheme on fp rate here, but
> the Graham scheme remains unbeaten on the fn rate. (I should add, of
> course, "as implemented by Tim" to all of those.)
>