[spambayes-dev] RE: [Spambayes] Re: Training empty messages problem

Fri Dec 17 02:31:50 CET 2004

Just so it's clear what's being discussed here - the changes we are
making/proposing only effect three areas:

 1. The tokenization of the message.
 2. The display of the message in the 'Show Clues' message.
 3. The message text if the export.py script is used.

Specifically, the changes don't effect any MUA at all (unless you import the
exported mail into another mailer, but then you ought to be using Outlook's
export facility).  These changes are intended to give more information to
the tokenizer and to the sort+group.py script - as long as the tokenizer can
parse them, the rest is (practically, if not aesthetically) unimportant.

[Seth Goodman]
> The second problem is the use of a semi-colon to separate addresses. 
> This is now supposed to be a comma, though the obsolete semi-colon 
> delimiter of RFC822 is explicitly supported.

Drat, I hadn't realised this.  That means our notate_to option is currently
wrong, which actually matters.  I'll fix that.  I agree with Kenny regarding
msgstore.py, though.

[Seth Goodman]
> This will also position you to more easily integrate
> with non-MS MUA's, hopefully open-source ones as they become
> popular enough.

This code is unlikely to be usable for anything outside Exchange, so it
shouldn't really matter.

[Seth Goodman]
>> Another general question on standards compliance is does
>> Spambayes support the Resent-*: series of headers?  These are
>> neither generated nor displayed by Outlook, since Microsoft
>> apparently never considered RFC2822 relevant.

[Kenny Pitt]
> By default, SpamBayes ignores these headers.  There are
> options that you can tweak in the config file if you want 
> them processed, though.  I believe the Tokenizer:safe_headers 
> option is where you would do this, but I've never used it 
> myself so I'm not 100% certain.

If you add the headers to Tokenizer:safe_headers then a token will be
generated that includes a count of how many times that header appears in the
message (something like 'Header:Resent-From:1') if the count is > 0 (and
even then, if Tokenizer:record_header_absence is True).  It doesn't do any
tokenization of the actual value of the header, though.

However, the value for the four headers you suggested are all address lists,
right (i.e. in the same form as a "To:" header).  In that case, they could
be added to Tokenizer:address_headers which would mean that you'd get tokens
like 'Resent-From:none', 'Resent-From:invalid', and
'Resent-From:addr:ta-meyer at ihug.co.nz'.

(You could add them to both if you want all the tokens).  Note that these
won't already be in your database, so unless you retrain they'll be of no
use until some training on messages that yield them is done.

[Seth Goodman]
> I looked at 
> file:///c:/Program%20Files/SpamBayes/docs/outlook/docs/configuration.html
> and it doesn't give any Tokenizer options at all, though they obviously
> exist.  The directions also state that there are no experimental options
> in this release.  Where else would I look to find a description of the
> supported configuration options?

The Outlook plug-in has two sets of configuration data (one's in the
'{profile name}.ini' file, the other is in the 'default_bayes_customize.ini'
file).  One is for things that only apply to Outlook (folder ids, timer
settings, etc), and the other is for options that are shared with the rest
of SpamBayes.  That file's all about the Outlook-only settings (including
Outlook-only experimental options, like the timer values once were) and
doesn't mention the SpamBayes-specific options at all.

Before 1.1/1.0.2 I'll update the documentation so that if Outlook users do
want to play around with these (the vast majority won't) then the
information is there.  I ought to have done this before asking users to give
the experimental options a go, but it slipped my mind (it's very simple for
users of the web interface).

For the moment, what you need to do is add the options to a file called
'default_bayes_customize.ini' in your Outlook plug-in's data directory (it
may or may not already exist).  The format is:

'''
[Section name]
option_name:option_value
'''

The log will tell you if you set an option with an invalid value.  To figure
out what's available, if you have Python installed, you can use the
instructions in FAQ 4.12:

<http://spambayes.org/faq.html#now-i-know-what-the-format-looks-like-but-wha
t-options-do-i-need-to-set>

If not, then the best available at the moment is to read the source of
Options.py, which ought to be reasonably readable.  The 1.0.1 version is
here:

<http://cvs.sf.net/viewcvs.py/spambayes/spambayes/spambayes/Options.py?rev=1
.107.4.1&view=markup>

[Seth Goodman]
> One of my email accounts also has a special header for Brightmail 
> detected spam that would be helpful to tokenize.  This is not the 
> Brightmail tracker header itself, but one that my ISP adds.

Yes, I get one of these too - "X-IHUG-iSpy".  It's probably some sort of
Brightmail option.

[Seth Goodman]
> This header is always the same text and is as follows:
>
> X-TDS-Spam: Potential Spam
> 
> Is there any support for tokenizing special headers like this? Could I 
> manually add this to the list of safe headers to tokenize with that 
> option?

Yes, however this will only be of use if the header doesn't appear in some
mail, otherwise the token will be 'Header:X-TDS-Spam:1' for every message,
and be no use at all.  My "X-IHUG-iSpy" header appears even if the message
isn't thought to be spam (then the value is "Doesn't appear to be Spam"), so
this would be the case for me.  If in your case the header isn't present for
such mail, then this would work.

To generate tokens with the content, too, you'll need to either write
specific code for it (like the Habeas(tm) headers, for example), or use
Tokenizer:basic_header_tokenize (off by default).  If that option is
enabled, then all headers generate tokens in the form "header:value", unless
the header is listed in the Tokenizer:basic_header_skip option.

=Tony.Meyer