[spambayes-dev] RE: [Spambayes] Re: Training empty messagesproblem
kenny.pitt at gmail.com
Thu Dec 16 23:21:15 CET 2004
Seth Goodman wrote:
>> From: Kenny Pitt
>> Sent: Thursday, December 16, 2004 9:56 AM
>> The correct format, I believe, would be:
>> To: kennypitt at hotpop.com; "Kenny Pitt" <KennyPitt at invalid>
>> Should be a simple matter of splitting the addresses on the ";"
>> character. I'm going to go take a shot at this and see what I get.
> This is acceptable, and typical for Outlook, but it does involve some
> legacy constructs which have been deprecated in RFC2822. RFC2822
> updates and replaces RFC822 for all practical purposes and is a
> better reference to use. It does list which "obsolete" address
> formats must be accepted. In particular, the use of the first
> address without angle brackets is deprecated, though recognized as a
> legacy format that must be accepted. Current practice is to include
> all addresses in angle brackets and unless that causes problems in
> Outlook, that would be preferable. The second problem is the use of
> a semi-colon to separate addresses. This is now supposed to be a
> comma, though the obsolete semi-colon delimiter of RFC822 is
> explicitly supported. My copy of Outlook2000 contains a check box to
> accept commas as address delimiters, which is the default setting,
> but it still produce semicolons for display. I think it would be
> prudent to accept either delimiter, in case MS ever gets a gram of
> clue and drops the deprecated format. This will also position you to
> more easily integrate with non-MS MUA's, hopefully open-source ones
> as they become popular enough. Microsoft never did give a rat's
> posterior about IETF standards and often uses them as a marketing
> tool to "differentiate" their products (translation: intentionally
> create interoperability problems).
That's all well and good, and I definitely appreciate the additional info,
but we're not really concerned with interoperability here so all this may be
overkill. The sole purpose here is to provide SpamBayes with something that
it can recognize and produce reasonable tokens from when asked to process an
Exchange message received from another local Exchange user.
As long as we know that Microsoft always uses ';' as the delimeter in the
Exchange Display Name field then I would prefer to stick to that delimeter
to keep the code concise. The reason there are no brackets around the
stand-alone Internet address is because that was how Outlook handed it to
me. The Python e-mail package already contains all the necessary parsing
logic to figure out what format the address is in. There didn't seem to be
much point in doing extra parsing of the address to figure out if I needed
to convert it to a different format that would then produce exactly the same
set of SpamBayes tokens.
> Another general question on standards compliance is does Spambayes
> support the Resent-*: series of headers? These are neither generated
> nor displayed by Outlook, since Microsoft apparently never considered
> RFC2822 relevant. However, many other MUA's use the remailing syntax
> of that standard, which uses those headers. Though they are defined
> as trace headers and in that sense are optional, they are required in
> order to use the remailing semantics of RFC2822 section 3.6.6. An
> example of this is Pine's bounce function. The fact that MS
> completely ignores those headers in their MUA's has created a huge
> problem for those of us who are involved in message authentication
> standards efforts. When used, those headers do contain important
> information, and as authentication becomes more common, they will
> become more important. My suggestion is that, of that whole series
> of headers, the ones that would be of interest to Spambayes are:
There are some differences between what non-Outlook versions of SpamBayes
such as sb_server, sb_filter, and sb_imapfilter will see and what the
Outlook addin will see because of the way Outlook destroys the original
structure of the message. However, one thing that *is* preserved is the
original headers of a message received via SMTP, so these headers should be
included if they were part of the original message.
By default, SpamBayes ignores these headers. There are options that you can
tweak in the config file if you want them processed, though. I believe the
Tokenizer:safe_headers option is where you would do this, but I've never
used it myself so I'm not 100% certain.
> Below is the relevant text from RFC2822. Some tokens are only
> defined in other sections and there are two that are worth describing
> here. "Phrase" is a quoted string, an atom or an obsolete format
> consisting of a combination of words including "." and CFWS. CFWS is
> "commented folding white space" that encompasses folding white space
> and comments, where comments are parenthesis-delimited strings. This
> is relevant to the way you described Outlook presenting some
> addresses and is the most serious difference from the standards.
> Even in RFC822, comments were permitted but expressly ignored in
> address strings, so Microsoft's practice is completely broken.
> RFC2822 specifically says that comments SHOULD NOT be included in
> address fields, as legacy implementations sometimes interpret the
> comments. Apparently, we are now a legacy application because MS has
> forced us to interpret the content of comments in order to get the
> correct address-list from their broken MUA. Buggers.
I don't remember mentioning anything about comments presented by Outlook in
the address string. The comments came from Tony's first pass at simulating
the address headers for an Exchange e-mail address, which is typically just
a real name without an RFC 822 or 2822 compatible e-mail address. That has
now been changed to use the real name along with a simulated RFC 822 (and
hopefully 2822) compliant local address using the standard <> brackets.
More information about the spambayes-dev