[Spambayes] Central limit

Tim Peters tim.one@comcast.net
Mon, 30 Sep 2002 12:38:30 -0400


[Rob Hooft]
>   - The standard deviations seem "underestimated". Gary already said
>     this can be caused by correlations between scores. Alternatively
>     this can indicate that the data is not 1D: in more than one
>     dimension, a higher percentage of normally distributed data lies
>     outside of the "core regions". Anyway, something can be done about
>     this: just calculate the RMS Z-score, and scale it to 1.0.

Sorry, I don't know what that means or how to compute it; neither does
google <wink>.  Let's say this is my population:  {2, 5, 10, 64}.  Then what
are the "RMS Z-score scaled to 1.0" thingies of 1, 2, 32, 64, and 1000?

> ...
>   - The "certainty" rule of Tim should be formalized.

Sure, but how?   I made up a combination of "look at ratios" and "different
cutoffs for different n" by iteratively staring at the errors and making
stuff up.  Even then all I get is a binary "certain or uncertain?" decision
out of it, and without a clear connection to quantifiable probabilities I
don't have strong reason to believe it's a sensible approach in general.

An alternative I haven't tried:  Consider the populations to be the set of
all ham-scores and spam-scores of msgs as a whole (rather than as funky
collections of individual "extreme word" probabilities).  I expect that has
a much better shot at being normally distributed; I don't know whether the
ham and spam populations so obtained would be separated well enough to be
useful, though.

> ...
>     Once this has crystallized a bit more we should exchange pickles
>     and see how well we can do with each-others training data!

One thing I noted before:  when training on more data than what I reported
on most recently, the log-central-limit population ham and spam means and
variances appeared remarkably insensitive to which random subset of msgs I
took to be "the population".  That was a good sign.  If we can figure out
what to *do* with these things <wink>, it's easy for me to whip up a little
program that will just compute these stats and display them; then we can
directly find out how sensitive they are cross-corpus for the people playing
along here.

>   - It should somehow be possible to classify messages into any
>     sumber of distinct groups using this trick. A new message can
>     get scored a Z-score to describe the likelyhood that it is part
>     of any of the groups, if all of these numbers are large, the test
>     message does not  belong to any class. I guess, e.g. that it
>     should not be too  difficult for the bayesian algorithms used
>     here to judge whether E-mail I receive is for "work", "private"
>     or "spam".

I don't expect this to work.  N-way classsification is a natural task for a
Bayesian classifier, but the only thing Bayesian about the approach here is
Gary's adjustment to the spamprobs as computed by counting.  This is
discussed in detail on Gary's web page, including a link to the tortured
reasoning it takes to find anything Bayesian in Paul's formulation.

WRT Gary's clt ideas, they seem to get huge bang for the buck out of that
we're making a binary decision, so that the two statistics of interest for
each word are ln(spamprob) and ln(1-spamprob).  This makes a very sharp
distinction for words with low or high spamprob; it's unclear how to split
this into an N-way distinction, or that doing so would be useful.  I suppose
that for each category C, you could compute ln(Cprob) and ln(1-Cprob).  Then
I wave my hands, and it all works great <wink>.