[Spambayes] Re: First result from Gary Robinson's ideas
Gary Robinson
grobinson@transpose.com
Wed, 18 Sep 2002 13:49:02 -0400
> The clearest difference:
> suppose Nigeria appears in 50 ham and 50 spam. You'll give it a spamprob
> approximately equal to 1/2 (fiddled a little by a and a*b). But I haven't
> given enough information there to say what Graham will give it! In his
> scheme, it also depends on the total number of hams and of spams. Say there
> are 100 spams and 200 hams. Then Graham effectively says "OK, it appears in
> half of the spams, but only a quarter of the hams, so if I see Nigeria it's
> twice as likely to be a spam, so the spamprob is really 2/3"
...
> I haven't tested this difference, but I *expect* this one can't be glossed
> over. Most people doing testing appear to have a lot more ham than spam to
> work with, and I expect that Graham's scheme of computing probs "as if" the
> numbers were equal (via comparing population ratios instead of raw counts)
> is really helping them.
Hmm. That's a good point. I'm pretty sure the key to it is to use the
Bayesian calcs, but TWICE.
That is:
First, look only at the spams. Let h represent the number of "hits", that
is, the number of spams containing the word "Nigeria", and let n be the
total number of spams.
Then the Bayesian probability of the next spam to be looked at containing
Nigeria is
a + h
-------
ab + n
where, again, a/b is the default assumed probability in the absence of data
and a is the weight of the assumption. (Again, in strict terms this is a
binomial probability based on a beta distribution prior.)
If we let a=0, it's the same as Paul (ignoring the little tricks like
multiplying by 2); h is 50 and n is 100 and the result is .5.
Now do the same thing on the ham side, and you get .25.
Then do the same combining trick to get 2/3.
Now, the advantage of using the Bayesian trick of having a and b is for the
cases where a word hasn't been seen before or has only been seen rarely
before. In such cases, you don't really know enough about it to make
reliable extrapolations about the probabilities. So, a and b serve to handle
that uncertain situation. As I said in the original essay, Paul handles it
manually by inserting .4 in the case of no earlier data, but that does
nothing for the cases where there is little data. If I understand you
correctly, you use .5. That is equivalent to a=1 and b=2, but the Bayesian
calc is much more general.
This is all a little bit weird because it's giving equal weight to the data
that comes from the spam and non-spam cases, which there is no real
justification for.
But I understand Tim's point in the following:
> Also, when it comes to deploying this, I doubt it
> will be possible to get people to keep training the system with a number of
> ham and spam in correct proportion to the real-life ratio; they'll tend to
> feed all their spam into their training data and skip adding any more ham;
> so it I expect it will really help if the system *doesn't* assume that the
> ratio of spam to ham it's trained on is any sort of estimate of the
> real-life ratio.
So if we assume that everything has been classified, which is an assumption
behind my suggestion, we end up with the distortion Tim describes above.
In that way, Paul's trick seems to be more realistic.
HOWEVER I think that rather than give the two cases equal weight, I think we
ought to find out what weighting leads to the best performance and use that
weight. Using equal weight seems like a good compromise during this
development period, and then we can tune it.
That's my $.02 anyway! :)
--Gary
--
Gary Robinson
CEO
Transpose, LLC
grobinson@transpose.com
207-942-3463
http://www.emergentmusic.com
http://radio.weblogs.com/0101454
> From: Tim Peters <tim.one@comcast.net>
> Date: Wed, 18 Sep 2002 11:51:17 -0400
> To: Gary Robinson <grobinson@transpose.com>
> Subject: RE: First result from Gary Robinson's ideas
>
>> Thanks for doing this testing, I really appreciate it!!!!
>
> Not as much as I appreciate your ideas <wink>.
>
>> Very interesting about how the spread could be used to look for things to
>> check manually.
>
> Not something I would have thought of; the people currently using
> SpamAssassin keep bringing this up. People running large mail hubs are
> willing to put *some* time into manual review, so it's great if a scheme has
> a useful and relatively small "middle ground".
>
>> Should I assume that, in general, I can post your emails to my
>> site and link to them from my essay when they help to illuminae the
> issues?
>
> No problem. It would be better for both of us, though, if you replied to
> the spambayes mailing list. That's a small group of bright people, and
> they'll have helpful things to say too, and then neither of us (especially
> me <wink>) will have to spend more time repackaging and regurgitating what
> we say here.
>
>> One thing... I also received test results this morning from Raymond Chen,
>> who found that my technique works much better when S is .71 than
>> when it is .5. If I am reading his results correctly, on Raymond's corpus
>> mine had about 60% fewer classification errors than Graham's.
>
> We already differ from Graham's original scheme in many details, and each of
> a few dozen changes made significant improvement in at least one of the
> error rates. For example, special tagging of embedded URLs cut the false
> negative rate in half, and boosting the "unknown word" probability from 0.2
> to 0.5 cut it by another third (BTW, Graham's writeup *currently* says he
> uses 0.4 for this; when we started, it said 0.2). And we've removed all but
> one of Graham's deliberate "let's favor ham" biases (we set the unknown word
> prob to 0.5, and got rid of the gimmick that assigned the unknown-word prob
> to words for which 2*hamcount + spamcount isn't at least 5). If Chen hasn't
> yet been driven to similar changes, his baseline performance is almost
> certainly much worse than ours, and I bet his baseline score histograms are
> much fuzzier too.
>
> I expect it's also likely that he's using smaller data sets, and perhaps not
> running a testing procedure as careful as N-fold cross validation
> (especially for smaller datasets, I think that last is crucial: the
> variance across runs has sometimes been high in smaller-dataset tests; I've
> seen other projects adopt or reject changes based on a single run on a
> single corpus, and that's just a recipe for random thrashing).
>
>> Raymond doesn't explicitly say so, but it looks like he only implemented
> the
>> first part of my ideas.
>
> The second part turns out to have several differences from what Graham does,
> beyond just whether to count word duplications. The clearest difference:
> suppose Nigeria appears in 50 ham and 50 spam. You'll give it a spamprob
> approximately equal to 1/2 (fiddled a little by a and a*b). But I haven't
> given enough information there to say what Graham will give it! In his
> scheme, it also depends on the total number of hams and of spams. Say there
> are 100 spams and 200 hams. Then Graham effectively says "OK, it appears in
> half of the spams, but only a quarter of the hams, so if I see Nigeria it's
> twice as likely to be a spam, so the spamprob is really 2/3" (note that I'm
> ignoring the complication introduced here by his multiplying the hamcounts
> by 2 -- in this specific example, he'll actually give Nigeria a spamprob of
> 1/2 too, but only because he pretends Nigeria actually appeared in 2*50 =
> 100 spam, so that the fiddled ratios 50/100 and 2*50/200 are equal).
>
> I haven't tested this difference, but I *expect* this one can't be glossed
> over. Most people doing testing appear to have a lot more ham than spam to
> work with, and I expect that Graham's scheme of computing probs "as if" the
> numbers were equal (via comparing population ratios instead of raw counts)
> is really helping them. Also, when it comes to deploying this, I doubt it
> will be possible to get people to keep training the system with a number of
> ham and spam in correct proportion to the real-life ratio; they'll tend to
> feed all their spam into their training data and skip adding any more ham;
> so it I expect it will really help if the system *doesn't* assume that the
> ratio of spam to ham it's trained on is any sort of estimate of the
> real-life ratio.
>