[spambayes-dev] RE: [Spambayes] Re: Training empty messagesproblem

Sat Dec 18 06:15:22 CET 2004

> From: Tony Meyer
> Sent: Thursday, December 16, 2004 7:32 PM

<...>

> [Seth Goodman]
> >> Another general question on standards compliance is does
> >> Spambayes support the Resent-*: series of headers?  These are
> >> neither generated nor displayed by Outlook, since Microsoft
> >> apparently never considered RFC2822 relevant.
>
> [Kenny Pitt]
> > By default, SpamBayes ignores these headers.  There are
> > options that you can tweak in the config file if you want
> > them processed, though.  I believe the Tokenizer:safe_headers
> > option is where you would do this, but I've never used it
> > myself so I'm not 100% certain.
>
> If you add the headers to Tokenizer:safe_headers then a token will be
> generated that includes a count of how many times that header appears
> in the message (something like 'Header:Resent-From:1') if the count
> is > 0 (and even then, if Tokenizer:record_header_absence is True).
> It doesn't do any tokenization of the actual value of the header,
> though.

This probably has little value.  Most messages don't currently contain
Resent-*: headers, so the occurrence is not currently valuable.  Depending
on which authentication schemes eventually come into use, the occurrence of
these may become a spam indicator as it is a back door around some of the
currently proposed authentication schemes, such as Yahoo DomainKeys (sad but
true).

>
> However, the value for the four headers you suggested are all
> address lists, right (i.e. in the same form as a "To:" header).

Exactly.

> In that case, they could be added to Tokenizer:address_headers
> which would mean that you'd get tokens like 'Resent-From:none',
> 'Resent-From:invalid', and 'Resent-From:addr:ta-meyer at ihug.co.nz'.

This works.

<...>

> If not, then the best available at the moment is to read the source of
> Options.py, which ought to be reasonably readable.  The 1.0.1 version is
> here:
>
>
http://cvs.sf.net/viewcvs.py/spambayes/spambayes/spambayes/Options.py?rev=1.
107.4.1&view=markup

Thanks.  Perfectly readable.

[Seth Goodman]
> > One of my email accounts also has a special header for Brightmail
> > detected spam that would be helpful to tokenize.  This is not the
> > Brightmail tracker header itself, but one that my ISP adds.
>
> Yes, I get one of these too - "X-IHUG-iSpy".  It's probably some sort of
> Brightmail option.

[Seth Goodman]
> > This header is always the same text and is as follows:
> >
> > X-TDS-Spam: Potential Spam
> >
> > Is there any support for tokenizing special headers like this? Could I
> > manually add this to the list of safe headers to tokenize with that
> > option?
>
> Yes, however this will only be of use if the header doesn't appear in some
> mail, otherwise the token will be 'Header:X-TDS-Spam:1' for every message,
> and be no use at all.  My "X-IHUG-iSpy" header appears even if the message
> isn't thought to be spam (then the value is "Doesn't appear to be Spam"),
so
> this would be the case for me.  If in your case the header isn't present
for
> such mail, then this would work.

This would work for me.

>
> To generate tokens with the content, too, you'll need to either write
> specific code for it (like the Habeas(tm) headers, for example), or use
> Tokenizer:basic_header_tokenize (off by default).  If that option is
> enabled, then all headers generate tokens in the form "header:value",
unless
> the header is listed in the Tokenizer:basic_header_skip option.

This would work for your header.

As long as I have to modify the source code to change the default list for
one or more options, I may as well do something that is useful to others.
My suspicion is that a number of users have ISP's that tag spam with
programs other than SpamAssassin, so some facility to do this through the
configuration file might be useful.  One idea is an additional Tokenizer
option called special_header_present.  The user would list the specific text
of the header after the option.  The option would cause the tokenizer to
generate a token with the count of the header noted.  For example, I would
have a single entry, since my ISP only puts the header in if it thinks the
message is spam:

[Tokenizer]
special_header_present:"X-TDS-Spam: Potential Spam"

Tony would have at least two entries, since the header is always there and
the content indicates if the ISP thinks it is spam:

[Tokenizer]
special_header_present:"X-IHUG-iSpy: Spam"
special_header_present:"X-IHUG-iSpy: Doesn't appear to be Spam"

Another possibility would be to have an option for special_header_content
and produce a single token for the string that appears in the header only if
the header is present.  This option would probably work better for inboxes
that collect from multiple accounts as there would not be any zero-count
tokens to skew the scores.

Does anyone else think this is useful or is it a waste of time?

--

Seth Goodman