[Spambayes] More on Training Disparity Issues
adam.walker at rbwconsulting.com
Sat Jul 17 08:30:13 CEST 2004
On Jul 16, 2004, at 11:21 PM, Richard B Barger ABC APR wrote:
> My posted question: I don't think the message header information
> contributes to the spam score.
There are many options that control what kind of tokens are generated
By default, only the "safe-headers" are tokenized.
The following is from Options.py, it show which headers are considered
("safe_headers", "Safe headers", ("abuse-reports-to", "date",
"""Like count_all_header_lines, but restricted to headers in this
safe_headers is ignored when count_all_header_lines is true, unless
record_header_absence is also true.""",
<snip section where Mr. Barger describes same email sent to different
> Am I ever surprised!
> Why is this?
Notice that "to" and a few other headers are considered "safe" headers.
If an email account gets a lot of trained spam addressed *to that
address* then it becomes a spamy token. Same as any word.
> It appears that using the SpamBayes Web Interface to classify a message
> gives dramatically different scores if all headers are included, versus
> if abridged or normal headers are all that's visible.
> Why is this? That's certainly not the behavior I expected.
The options mentioned above influence this behavior.
> If I'm correct, the implication is training would be vastly different,
> depending on whether the user displays full headers or not.
> So, which do you recommend users train on: Full headers or normal?
You should train/classify using the raw message source, not just a view
of the headers and a view of message. If you did not train using raw
messages, then I highly recommend starting over. It may solve some of
your other problems.
More information about the Spambayes