[Spambayes] More on Training Disparity Issues
Adam Walker
adam.walker at rbwconsulting.com
Sat Jul 17 08:30:13 CEST 2004
On Jul 16, 2004, at 11:21 PM, Richard B Barger ABC APR wrote:
> My posted question: I don't think the message header information
> contributes to the spam score.
There are many options that control what kind of tokens are generated
from headers.
By default, only the "safe-headers" are tokenized.
The following is from Options.py, it show which headers are considered
by default...
("safe_headers", "Safe headers", ("abuse-reports-to", "date",
"errors-to",
"from", "importance",
"in-reply-to",
"message-id", "mime-version",
"organization", "received",
"reply-to", "return-path",
"subject",
"to", "user-agent",
"x-abuse-info",
"x-complaints-to", "x-face"),
"""Like count_all_header_lines, but restricted to headers in this
list.
safe_headers is ignored when count_all_header_lines is true, unless
record_header_absence is also true.""",
HEADER_NAME, RESTORE),
<snip section where Mr. Barger describes same email sent to different
accounts>
> Am I ever surprised!
>
> Why is this?
>
Notice that "to" and a few other headers are considered "safe" headers.
If an email account gets a lot of trained spam addressed *to that
address* then it becomes a spamy token. Same as any word.
> It appears that using the SpamBayes Web Interface to classify a message
> gives dramatically different scores if all headers are included, versus
> if abridged or normal headers are all that's visible.
>
> Why is this? That's certainly not the behavior I expected.
>
The options mentioned above influence this behavior.
> If I'm correct, the implication is training would be vastly different,
> depending on whether the user displays full headers or not.
> True?
>
> So, which do you recommend users train on: Full headers or normal?
>
You should train/classify using the raw message source, not just a view
of the headers and a view of message. If you did not train using raw
messages, then I highly recommend starting over. It may solve some of
your other problems.
More information about the Spambayes
mailing list