[Spambayes] More on Training Disparity Issues

Sat Jul 17 16:45:38 CEST 2004

Thank you very much, Adam.  I'm not familiar with Python coding or what goes
on behind the scenes in SpamBayes, so your explanation is particularly
helpful.

Of course, training on raw message source involves yet more steps for each
message I want to train, but your clear explanation shows that it likely
would be valuable -- perhaps extremely so.

Actually, now that I understand better, I see a likely reason for the
disparity in scoring when the same message was sent to different accounts:

I usually manually process the RBarger.com account first.  If I've manually
trained on a ham message sent there, I am much less likely to train on the
same (or a very similar) ham message that appears in the CornerBarPR.com
account -- and I get quite a few of those.  Further, when I force-train ham
to keep up the ham-spam balance, I most frequently use ham from the
RBarger.com folder, not from the CornerBarPR.com folder.

So, I've obviously been weighting the ham messages toward RBarger.com and
against CornerBarPR.com.

In addition, I get somewhat more -- and different types of -- spam on the
CornerBarPR.com account; so, a given message in CornerBarPR.com is more
likely to be trained as spam than a given message in RBarger.com.

Based on what you explained, it appears that I am unintentionally telling
SpamBayes that, before the message body is even considered, if a message
comes to RBarger.com, it is somewhat more likely to be ham, and, if a message
comes to CornerBarPR.com, it is slightly more likely to be spam.

What a fascinating revelation.

Thank you, good sir.

BTW, please don't let Mr. Walker's helpful msg discourage other gurus from
answering the other questions -- spam cutoff scores, misclassification,
spammer ploys, and other issues -- I posed last evening.

Thanks again.

Rich Barger
Kansas City

---

Adam Walker wrote:

> On Jul 16, 2004, at 11:21 PM, Richard B Barger ABC APR wrote:
> > My posted question:  I don't think the message header information
> > contributes to the spam score.
>
> There are many options that control what kind of tokens are generated
> from headers.
> By default, only the "safe-headers" are tokenized.
> The following is from Options.py, it show which headers are considered
> by default...
>
>     ("safe_headers", "Safe headers", ("abuse-reports-to", "date",
> "errors-to",
>                                        "from", "importance",
> "in-reply-to",
>                                        "message-id", "mime-version",
>                                        "organization", "received",
>                                        "reply-to", "return-path",
> "subject",
>                                        "to", "user-agent",
> "x-abuse-info",
>                                        "x-complaints-to", "x-face"),
>       """Like count_all_header_lines, but restricted to headers in this
> list.
>       safe_headers is ignored when count_all_header_lines is true, unless
>       record_header_absence is also true.""",
>       HEADER_NAME, RESTORE),
>
> <snip section where Mr. Barger describes same email sent to different
> accounts>
>
> > Am I ever surprised!
> >
> > Why is this?
> >
>
> Notice that "to" and a few other headers are considered "safe" headers.
> If an email account gets a lot of trained spam addressed *to that
> address* then it becomes a spamy token. Same as any word.
>
> > It appears that using the SpamBayes Web Interface to classify a message
> > gives dramatically different scores if all headers are included, versus
> > if abridged or normal headers are all that's visible.
> >
> > Why is this?  That's certainly not the behavior I expected.
> >
>
> The options mentioned above influence this behavior.
>
> > If I'm correct, the implication is training would be vastly different,
> > depending on whether the user displays full headers or not.
>
> > True?
> >
> > So, which do you recommend users train on:  Full headers or normal?
> >
>
> You should train/classify using the raw message source, not just a view
> of the headers and a view of message. If you did not train using raw
> messages, then I highly recommend starting over. It may solve some of
> your other problems.