[Spambayes] More on Training Disparity Issues

Adam Walker adam.walker at rbwconsulting.com
Sat Jul 17 08:30:13 CEST 2004


On Jul 16, 2004, at 11:21 PM, Richard B Barger ABC APR wrote:
> My posted question:  I don't think the message header information
> contributes to the spam score.

There are many options that control what kind of tokens are generated 
from headers.
By default, only the "safe-headers" are tokenized.
The following is from Options.py, it show which headers are considered 
by default...

    ("safe_headers", "Safe headers", ("abuse-reports-to", "date", 
"errors-to",
                                       "from", "importance", 
"in-reply-to",
                                       "message-id", "mime-version",
                                       "organization", "received",
                                       "reply-to", "return-path", 
"subject",
                                       "to", "user-agent", 
"x-abuse-info",
                                       "x-complaints-to", "x-face"),
      """Like count_all_header_lines, but restricted to headers in this 
list.
      safe_headers is ignored when count_all_header_lines is true, unless
      record_header_absence is also true.""",
      HEADER_NAME, RESTORE),

<snip section where Mr. Barger describes same email sent to different 
accounts>

> Am I ever surprised!
>
> Why is this?
>

Notice that "to" and a few other headers are considered "safe" headers. 
If an email account gets a lot of trained spam addressed *to that 
address* then it becomes a spamy token. Same as any word.

> It appears that using the SpamBayes Web Interface to classify a message
> gives dramatically different scores if all headers are included, versus
> if abridged or normal headers are all that's visible.
>
> Why is this?  That's certainly not the behavior I expected.
>

The options mentioned above influence this behavior.

> If I'm correct, the implication is training would be vastly different,
> depending on whether the user displays full headers or not.

> True?
>
> So, which do you recommend users train on:  Full headers or normal?
>


You should train/classify using the raw message source, not just a view 
of the headers and a view of message. If you did not train using raw 
messages, then I highly recommend starting over. It may solve some of 
your other problems.



More information about the Spambayes mailing list