[Spambayes] Re: First result from Gary Robinson's ideas

Gary Robinson grobinson@transpose.com
Wed, 18 Sep 2002 14:23:36 -0400


OK, continuing to reply to myself... here's what I think is a decent
solution.

There are definitely advantages of your way of calculating the word
probabilities compared to the one in my essay, as Tim pointed out.

However there is still reason to use a Bayesian approach LIKE what I pointed
to, in order to handle the "unknown word" case which Paul handles with .4
and you handle with .5 in a way that also deals with the "almost unknown
word" case of just having a very small amount of data for the word.

Let p(w) be the probability for the word as currently calculated (with NO
calculation done for the case of an "unknown word" -- this handles that).
Let n be the number of emails that contain word w that went into the
calculation.

Then let

        a + (n * p(w))
f(w) = ---------------
        (a * b) + n

b = 1/avg(p(w)) across all words (maybe a weighted average that factors in
the relative frequency of the words, not sure about this right now), or
which can simply be tuned by hand, and

a is 1 to start with but is tuned for performance.

Note that unknown words have n=0, so the probability is a/(a*b) which may be
.4, .5 or whatever works best. For example, we could choose a=1 and b=2.5 or
2.0, to arrive at .4 or .5, respectively.

Note that f(w) is not justified by looking it as a binomial distribution
with a beta prior. Actually, I have a way of justifying it based on a
multinomial distribution with a Dirichlet prior that I won't go into here.
Maybe there are other ways of justifying it.

Rigorous justification from a Bayesian POV or not, I am willing to assert
that it makes sense in this context.

Unless I've made another dumb-ass error. ;)



--Gary


-- 
Gary Robinson
CEO
Transpose, LLC
grobinson@transpose.com
207-942-3463
http://www.emergentmusic.com
http://radio.weblogs.com/0101454


> From: Gary Robinson <grobinson@transpose.com>
> Date: Wed, 18 Sep 2002 14:02:15 -0400
> To: Tim Peters <tim.one@comcast.net>, SpamBayes <spambayes@python.org>
> Subject: Re: First result from Gary Robinson's ideas
> 
> Responding to my own post, sorry. I think I used the "unknown word" examples
> of .4 and .5 in a screwed-up way below so please ignore those comments.
> 
> Also, I think that a and b would be overwhelmed by the real-world data if the
> is doing a lot of classifying, so there may really be no need to bother with
> them at all; they'd only be useful in the situation where there were a very
> small number of emails being classified.
> 
> Hmmmm. I want to think about this a bit more.
> 
> 
> 
> --Gary
> 
> 
> -- 
> Gary Robinson
> CEO
> Transpose, LLC
> grobinson@transpose.com
> 207-942-3463
> http://www.emergentmusic.com
> http://radio.weblogs.com/0101454
> 
> 
>> From: Gary Robinson <grobinson@transpose.com>
>> Date: Wed, 18 Sep 2002 13:49:02 -0400
>> To: Tim Peters <tim.one@comcast.net>, SpamBayes <spambayes@python.org>
>> Subject: Re: First result from Gary Robinson's ideas
>> 
>>> The clearest difference:
>>> suppose Nigeria appears in 50 ham and 50 spam.  You'll give it a spamprob
>>> approximately equal to 1/2 (fiddled a little by a and a*b).  But I haven't
>>> given enough information there to say what Graham will give it!  In his
>>> scheme, it also depends on the total number of hams and of spams.  Say there
>>> are 100 spams and 200 hams.  Then Graham effectively says "OK, it appears in
>>> half of the spams, but only a quarter of the hams, so if I see Nigeria it's
>>> twice as likely to be a spam, so the spamprob is really 2/3"
>> ...
>>> I haven't tested this difference, but I *expect* this one can't be glossed
>>> over.  Most people doing testing appear to have a lot more ham than spam to
>>> work with, and I expect that Graham's scheme of computing probs "as if" the
>>> numbers were equal (via comparing population ratios instead of raw counts)
>>> is really helping them.
>> 
>> Hmm. That's a good point. I'm pretty sure the key to it is to use the
>> Bayesian 
>> calcs, but TWICE.
>> 
>> That is:
>> 
>> First, look only at the spams. Let h represent the number of "hits", that is,
>> the number of spams containing the word "Nigeria", and let n be the total
>> number of spams.
>> 
>> Then the Bayesian probability of the next spam to be looked at containing
>> Nigeria is 
>> 
>> a + h
>> -------
>> ab + n
>> 
>> where, again, a/b is the default assumed probability in the absence of data
>> and a is the weight of the assumption. (Again, in strict terms this is a
>> binomial probability based on a beta distribution prior.)
>> 
>> If we let a=0, it's the same as Paul (ignoring the little tricks like
>> multiplying by 2); h is 50 and n is 100 and the result is .5.
>> 
>> Now do the same thing on the ham side, and you get .25.
>> 
>> Then do the same combining trick to get 2/3.
>> 
>> Now, the advantage of using the Bayesian trick of having a and b is for the
>> cases where a word hasn't been seen before or has only been seen rarely
>> before. In such cases, you don't really know enough about it to make reliable
>> extrapolations about the probabilities. So, a and b serve to handle that
>> uncertain situation. As I said in the original essay, Paul handles it
>> manually 
>> by inserting .4 in the case of no earlier data, but that does nothing for the
>> cases where there is little data. If I understand you correctly, you use .5.
>> That is equivalent to a=1 and b=2, but the Bayesian calc is much more
>> general.
>> 
>> This is all a little bit weird because it's giving equal weight to the data
>> that comes from the spam and non-spam cases, which there is no real
>> justification for.
>> 
>> But I understand Tim's point in the following:
>> 
>>> Also, when it comes to deploying this, I doubt it
>>> will be possible to get people to keep training the system with a number of
>>> ham and spam in correct proportion to the real-life ratio; they'll tend to
>>> feed all their spam into their training data and skip adding any more ham;
>>> so it I expect it will really help if the system *doesn't* assume that the
>>> ratio of spam to ham it's trained on is any sort of estimate of the
>>> real-life ratio.
>> 
>> So if we assume that everything has been classified, which is an assumption
>> behind my suggestion, we end up with the distortion Tim describes above.
>> 
>> In that way, Paul's trick seems to be more realistic.
>> 
>> HOWEVER I think that rather than give the two cases equal weight, I think we
>> ought to find out what weighting leads to the best performance and use that
>> weight. Using equal weight seems like a good compromise during this
>> development period, and then we can tune it.
>> 
>> That's my $.02 anyway! :)
>> 
>> 
>> --Gary
>> 
>> 
>> -- 
>> Gary Robinson
>> CEO
>> Transpose, LLC
>> grobinson@transpose.com
>> 207-942-3463
>> http://www.emergentmusic.com
>> http://radio.weblogs.com/0101454
>> 
>> 
>>> From: Tim Peters <tim.one@comcast.net>
>>> Date: Wed, 18 Sep 2002 11:51:17 -0400
>>> To: Gary Robinson <grobinson@transpose.com>
>>> Subject: RE: First result from Gary Robinson's ideas
>>> 
>>>> Thanks for doing this testing, I really appreciate it!!!!
>>> 
>>> Not as much as I appreciate your ideas <wink>.
>>> 
>>>> Very interesting about how the spread could be used to look for things to
>>>> check manually.
>>> 
>>> Not something I would have thought of; the people currently using
>>> SpamAssassin keep bringing this up.  People running large mail hubs are
>>> willing to put *some* time into manual review, so it's great if a scheme has
>>> a useful and relatively small "middle ground".
>>> 
>>>> Should I assume that, in general, I can post your emails to my
>>>> site and link to them from my essay when they help to illuminae the
>>> issues?
>>> 
>>> No problem.  It would be better for both of us, though, if you replied to
>>> the spambayes mailing list.  That's a small group of bright people, and
>>> they'll have helpful things to say too, and then neither of us (especially
>>> me <wink>) will have to spend more time repackaging and regurgitating what
>>> we say here.
>>> 
>>>> One thing... I also received test results this morning from Raymond Chen,
>>>> who found that my technique works much better when S is .71 than
>>>> when it is .5. If I am reading his results correctly, on Raymond's corpus
>>>> mine had about 60% fewer classification errors than Graham's.
>>> 
>>> We already differ from Graham's original scheme in many details, and each of
>>> a few dozen changes made significant improvement in at least one of the
>>> error rates.  For example, special tagging of embedded URLs cut the false
>>> negative rate in half, and boosting the "unknown word" probability from 0.2
>>> to 0.5 cut it by another third (BTW, Graham's writeup *currently* says he
>>> uses 0.4 for this; when we started, it said 0.2).  And we've removed all but
>>> one of Graham's deliberate "let's favor ham" biases (we set the unknown word
>>> prob to 0.5, and got rid of the gimmick that assigned the unknown-word prob
>>> to words for which 2*hamcount + spamcount isn't at least 5).  If Chen hasn't
>>> yet been driven to similar changes, his baseline performance is almost
>>> certainly much worse than ours, and I bet his baseline score histograms are
>>> much fuzzier too.
>>> 
>>> I expect it's also likely that he's using smaller data sets, and perhaps not
>>> running a testing procedure as careful as N-fold cross validation
>>> (especially for smaller datasets, I think that last is crucial:  the
>>> variance across runs has sometimes been high in smaller-dataset tests;  I've
>>> seen other projects adopt or reject changes based on a single run on a
>>> single corpus, and that's just a recipe for random thrashing).
>>> 
>>>> Raymond doesn't explicitly say so, but it looks like he only implemented
>>> the
>>>> first part of my ideas.
>>> 
>>> The second part turns out to have several differences from what Graham does,
>>> beyond just whether to count word duplications.  The clearest difference:
>>> suppose Nigeria appears in 50 ham and 50 spam.  You'll give it a spamprob
>>> approximately equal to 1/2 (fiddled a little by a and a*b).  But I haven't
>>> given enough information there to say what Graham will give it!  In his
>>> scheme, it also depends on the total number of hams and of spams.  Say there
>>> are 100 spams and 200 hams.  Then Graham effectively says "OK, it appears in
>>> half of the spams, but only a quarter of the hams, so if I see Nigeria it's
>>> twice as likely to be a spam, so the spamprob is really 2/3" (note that I'm
>>> ignoring the complication introduced here by his multiplying the hamcounts
>>> by 2 -- in this specific example, he'll actually give Nigeria a spamprob of
>>> 1/2 too, but only because he pretends Nigeria actually appeared in 2*50 =
>>> 100 spam, so that the fiddled ratios 50/100 and 2*50/200 are equal).
>>> 
>>> I haven't tested this difference, but I *expect* this one can't be glossed
>>> over.  Most people doing testing appear to have a lot more ham than spam to
>>> work with, and I expect that Graham's scheme of computing probs "as if" the
>>> numbers were equal (via comparing population ratios instead of raw counts)
>>> is really helping them.  Also, when it comes to deploying this, I doubt it
>>> will be possible to get people to keep training the system with a number of
>>> ham and spam in correct proportion to the real-life ratio; they'll tend to
>>> feed all their spam into their training data and skip adding any more ham;
>>> so it I expect it will really help if the system *doesn't* assume that the
>>> ratio of spam to ham it's trained on is any sort of estimate of the
>>> real-life ratio.
>>>