[Spambayes] More on Training Disparity Issues

Sat Jul 17 05:21:08 CEST 2004

Okay, reference my just-sent message about SpamBayes POP3 Proxy Version
1.0rc2:

Wow!  I just answered one of my own questions:

My posted question:  I don't think the message header information
contributes to the spam score.  I don't see it showing up in the message
clues.  But, just to be sure, here's my question:  Would >identical<
mail sent to rich at rbarger.com, rich at cornerbarpr.com,
info at cornerbarpr.com, support at cornerbarpr.com, rich at swbell.net,
rich at sprintmail.com, etc., be scored the same?  Or would the different
addressees and other minor differences in the headers cause scoring
differences?

I just sent an identical semi-spammy message to three different
accounts.  Each one scored differently -- dramatically so:

To Rich at RBarger.com:  X-Spambayes-Spam-Probability: 0.08321

To Rich at CornerBarPR.com:  X-Spambayes-Spam-Probability: 0.26562

To Info at CornerBarPR.com:  X-Spambayes-Spam-Probability: 0.37489

Am I ever surprised!

Why is this?

At any rate, it tends to confirm my observation that, in the same
message stream, the CornerBarPR.com folder seems to have a greater
proportion of unsures than the RBarger.com folder.

And I partially answered another Q:

It appears that using the SpamBayes Web Interface to classify a message
gives dramatically different scores if all headers are included, versus
if abridged or normal headers are all that's visible.

Why is this?  That's certainly not the behavior I expected.

If I'm correct, the implication is training would be vastly different,
depending on whether the user displays full headers or not.

True?

So, which do you recommend users train on:  Full headers or normal?

Sheesh.  The more I find out, the more confused I become.

Thanks.

Rich Barger
Kansas City