[Spambayes] RE: Central Limit Theorem??!! :)

Fri, 27 Sep 2002 06:57:45 -0400

> That brings us to the mistakes, and this seems a *typical* mistake when a
> false negative pops up (there are plenty of words here; that's not the
> problem):
> 
> zham            zspam
> -17.8741370033  -20.4279646914
> 
> The best guess it can make is that it's closer to ham, but let's get real
> about this <wink>:  these are honest-to-God probabilities (well, directly
> related to honest-to-God probabilities), and at 18 sdevs away from the ham
> mean, the system is screaming there's not a chance in hell the msg fits what
> it knows about ham.  It's *also* screaming there's not a chance in hell the
> msg fits what it knows about spam.

I think what you say above makes a lot of sense. Very interesting!

> The only rational conclusion to draw is that the system is utterly baffled,
> *and knows it*, so should kick such a msg out for manual review.  Not only
> "a middle ground", but a principled middle ground where the system itself
> knows it has no confidence in its decision, because both outcomes are
> astronomically unlikely based on all it knows.

Yes. That sounds right!

--Gary

-- 
Gary Robinson
CEO
Transpose, LLC
grobinson@transpose.com
207-942-3463
http://www.emergentmusic.com
http://radio.weblogs.com/0101454

> From: Tim Peters <tim.one@comcast.net>
> Date: Fri, 27 Sep 2002 00:43:17 -0400
> To: Gary Robinson <grobinson@transpose.com>
> Cc: SpamBayes <spambayes@python.org>, Greg Louis <glouis@dynamicro.on.ca>
> Subject: RE: [Spambayes] RE: Central Limit Theorem??!!     :)
> 
> [Gary Robinson]
>> ...
>> I wonder if the errors for lcm are mostly in the region where there
>> are a small number of data points, such that the central mean theorem
>> isn't really kicking in.
>> 
>> That is, it may be that there is some number n for which f(w)
>> gives the best results if the number of non-middling words is < n and
>> and lcm gives the best results if the number of non-middling words
>> is > n.
>> 
>> That WOULD make a lot of theoretical sense, because for small
>> enough n, the central mean theorem is meaningless and can only make
>> trouble for us.
>> 
>> Something else for you to test in your copious free time. ;)
> 
> This may be very cool!  I speculated about another possibility for errors in
> this approach with Guido, and dumping in some instrumentation appears to
> confirm it.
> 
> First, yes, some errors are due to very low n -- like it only finds 8 words
> in an entire msg.  Those are hard to score for any scheme.  But these cases
> usually *also* suffer the same problem I'll eventually get around to
> revealing <wink>.
> 
> Second, here are some *typical* internal z-scores while predicting ham, all
> with at least 30 non-middling words (this is just a slice I took from the
> output, while it was predicting against 6 known ham):
> 
> zham           zspam
> 2.29985206263  -76.3424101961
> 0.187039535126 -60.6685540929
> 0.16058734364  -43.5223790527
> 0.303545599809 -64.5043366748
> 2.32811619768  -80.3108808262
> 2.08243355217  -56.6967511599
> 
> Now if something is 60 spam sdevs away from the spam mean, and 1 ham sdev
> away from the ham mean, extreme confidence is surely justified.  While
> predicting a spam, it's never so extreme, because the population ham
> variance is much larger than the population spam variance, so the z-scores
> away from the ham mean simply can't get as large (btw, I believe this is why
> it has such a pronounced tendency to err on the false negative side); even
> so, extreme confidence is still justified with numbers like these (a typical
> slice when predicting against 6 known spam):
> 
> zham           zspam
> -26.1507326771 -0.680077213248
> -28.3253589669  1.10297422272
> -28.3253589669  1.10297422272
> -28.9332374355  1.31350047503
> -26.5203302612 -0.236968008101
> -37.2333822428 -0.722498689497
> 
> That brings us to the mistakes, and this seems a *typical* mistake when a
> false negative pops up (there are plenty of words here; that's not the
> problem):
> 
> zham            zspam
> -17.8741370033  -20.4279646914
> 
> The best guess it can make is that it's closer to ham, but let's get real
> about this <wink>:  these are honest-to-God probabilities (well, directly
> related to honest-to-God probabilities), and at 18 sdevs away from the ham
> mean, the system is screaming there's not a chance in hell the msg fits what
> it knows about ham.  It's *also* screaming there's not a chance in hell the
> msg fits what it knows about spam.
> 
> The only rational conclusion to draw is that the system is utterly baffled,
> *and knows it*, so should kick such a msg out for manual review.  Not only
> "a middle ground", but a principled middle ground where the system itself
> knows it has no confidence in its decision, because both outcomes are
> astronomically unlikely based on all it knows.
> 
> The messages that fall into this class *are* unusual, too!  I'm still
> staring at two from the last run trying to decide whether they're really ham
> or spam!  A third is one we debated on this list, and it took a google
> search for related msgs to decide it was really spam.  That's extremely
> cool, if the pattern holds:  nothing else has been so certain about its
> uncertainty, and nothing else has pinpointed msgs I'm also uncertain about.
> 
> I can't make more time to pursue this now, but it's very exciting.
> 
> Another thing that *may* be cool:  the central limit approaches are a bitch
> to train over time, because new messages change probabilities, probabilities
> change extremes, and that means whenever you add a message then "in theory"
> you should go back over all the ham and spam you've ever trained on and grab
> what may be new extremes from them (in order to compute new population means
> and vars).
> 
> But here are the ham and spam population means and vars from two runs on
> disjoint random subsets of 2000 ham and 1400 spam (this is the logarithmic
> version):
> 
> Run 1
> hammean  -0.324091120632 hamvar  0.555156484853
> spammean -0.104654537681 spamvar 0.121601025545
> 
> Run 2
> hammean  -0.321945924099 hamvar  0.546761402392
> spammean -0.105809267575 spamvar 0.124020283754
> 
> They're much the same across runs.  This, combined with the exteme imbalance
> in z-scores during "typical" predictions, suggests that it may be possible
> to do this *part* of training only once -- if the population means and
> variances here have any sort of objective meaning <wink>, they're simply not
> going to change much provided they were trained on lots of data to begin
> with.
>