[Spambayes] statistical comparison of enviroment?

Rob W. W. Hooft rob at hooft.net
Thu Mar 6 11:13:04 EST 2003

Skip Montanaro wrote:
>  Suppose improvement A
> takes that to 83% and applied independently to the base system, improvement
> B takes that to 85%.  How do you tell how independent A and B are from one
> another?

Separate from all the good suggestions already made to help this, I would say that a little information entropy would do wonders.

Say we have one token that occurs in 25 out of 100 messages, regardless of whether they are ham or spam. And another one that does also hit 25 out of the same 100 messages.

present absent
token1      25     75
token2      25     75

In this case, both tokens have an information entropy (S) of:

S = 0.25*log_e(1/0.25)+0.75*log_e(1/0.75) = 0.56 bit

Combining the two tokens can give different possibilities, among which:

token1
token2  present   absent
present   25       0              S = 0.56 bit
absent     0       75

token1
token2  present   absent
present    9       16             S = 1.11 bit
absent    16       59

token1
token2  present   absent
present    0       25             S = 1.03 bit
absent    25       50

This way it is possible to see how many "bits" of information are obtained from one token individually, or by combining tokens. In general, combining tokens will give less than the sum of their individual contributions. How much less is a quantitave measure of the correlation of the tokens. Of course this does not make any prediction as to the suitability of each token to characterize a message as spam. Someone with better background in information theory can probably combine the information entropy with the suitability in a proper way. In any case, if the two tokens under study are correlated as in the first combination (25/0/0/75), they are equally suited for spam classification.

Regards,

Rob

--
Rob W.W. Hooft  ||  rob at hooft.net  ||  http://www.hooft.net/people/rob/