[spambayes-dev] RE: [Spambayes] Re: Training empty messages problem

Tony Meyer tameyer at ihug.co.nz
Tue Dec 14 08:17:30 CET 2004


> I get the same message-id behavior from our Exchange server.  

Ah - so do I, I realise now.  I was looking just in the headers that
SpamBayes shows in "Show Clues", but I see I get the invalid token even
without one showing there.  Looking at the tokenizing code, if there is no
message-id header, then the 'invalid' token is generated.  (I suppose the
thought is that not having the header is not a valid case).

> Notice that there is no message id header of any sort, and that
> the From and To fields do not use Internet standard address format.
> The following tokens were included among the clues, and are typical for
> most if not all of my Exchange mail:
> 
> """
> token                               spamprob         #ham  #spam
> 'message-id:invalid'                0.214766           19      9
> 'x-mailer:none'                     0.622068           88    258
> 'from:no real name:2**0'            0.642539           29     93
> """
> 
> Maybe there's a property in the Outlook message object 
> somewhere that we need to retrieve and add to the headers when we
> reconstruct the message?

Maybe we ought to be making an attempt to generate headers for all those in
the safe_headers option (or, alternatively, changing the default value for
safe_headers for Outlook users.  The headers that we could probably generate
include "date", "from" (we could be smarter about how it is presented),
"importance", "in-reply-to" (?), "message-id", "organization" (?),
"received" (maybe too much effort), "reply-to", "to" (smarter), and
"user-agent".  We could generate "x-mailer", which is tokenized separately,
too.

None of this is hard - it's just a case of running
Outlook2000/sandbox/dumpprops.py on one of these messages, looking up the
appropriate property names, and then modifying the function to get & format
the appropriate data.  I guess (but do not know) that getting a few extra
properties as well as the ones we already get would not significantly effect
the time that was required.

However, there is the question of whether this will help or hinder.  At the
moment, we get a whole bunch of "I'm an Exchange message" tokens, which I
suspect for most people are significant ham clues.  If we replace those with
more data, maybe it'll be harder to nail Exchange messages (I would guess
not, but stupid beats smart, etc).

We could add an (experimental?) Outlook option
"synthesised_exchange_headers", which lists headers (like those above) to
try and synthesise (the current situation being "to,from,subject").  That
way at least users could relatively easily change the situation (e.g. revert
back to 1.0.x behaviour).  (Retraining would probably be necessary to have
much effect, though).

I'll try and find time to whip up something like this and run some test
scripts with it (although the ratio of Exchange mail will have a big
influence on results, I imagine) and see what happens.  Probably not until
the end of the week, or the start of next one, though.

At least, since it's Outlook, if we make the situation worse, Tim will
probably notice and yell at us <wink>.

=Tony.Meyer



More information about the spambayes-dev mailing list