[spambayes-dev] RE: [Spambayes] Re: Training empty messagesproblem

Seth Goodman sethg at GoodmanAssociates.com
Thu Dec 16 22:05:44 CET 2004

> From: Kenny Pitt
> Sent: Thursday, December 16, 2004 9:56 AM


> The correct format, I believe, would be:
> To: kennypitt at hotpop.com; "Kenny Pitt" <KennyPitt at invalid>
> Should be a simple matter of splitting the addresses on the ";" character.
> I'm going to go take a shot at this and see what I get.

This is acceptable, and typical for Outlook, but it does involve some legacy
constructs which have been deprecated in RFC2822.  RFC2822 updates and
replaces RFC822 for all practical purposes and is a better reference to use.
It does list which "obsolete" address formats must be accepted.  In
particular, the use of the first address without angle brackets is
deprecated, though recognized as a legacy format that must be accepted.
Current practice is to include all addresses in angle brackets and unless
that causes problems in Outlook, that would be preferable.  The second
problem is the use of a semi-colon to separate addresses.  This is now
supposed to be a comma, though the obsolete semi-colon delimiter of RFC822
is explicitly supported.  My copy of Outlook2000 contains a check box to
accept commas as address delimiters, which is the default setting, but it
still produce semicolons for display.  I think it would be prudent to accept
either delimiter, in case MS ever gets a gram of clue and drops the
deprecated format.  This will also position you to more easily integrate
with non-MS MUA's, hopefully open-source ones as they become popular enough.
Microsoft never did give a rat's posterior about IETF standards and often
uses them as a marketing tool to "differentiate" their products
(translation: intentionally create interoperability problems).

Another general question on standards compliance is does Spambayes support
the Resent-*: series of headers?  These are neither generated nor displayed
by Outlook, since Microsoft apparently never considered RFC2822 relevant.
However, many other MUA's use the remailing syntax of that standard, which
uses those headers.  Though they are defined as trace headers and in that
sense are optional, they are required in order to use the remailing
semantics of RFC2822 section 3.6.6.  An example of this is Pine's bounce
function.  The fact that MS completely ignores those headers in their MUA's
has created a huge problem for those of us who are involved in message
authentication standards efforts.  When used, those headers do contain
important information, and as authentication becomes more common, they will
become more important.  My suggestion is that, of that whole series of
headers, the ones that would be of interest to Spambayes are:


Below is the relevant text from RFC2822.  Some tokens are only defined in
other sections and there are two that are worth describing here.  "Phrase"
is a quoted string, an atom or an obsolete format consisting of a
combination of words including "." and CFWS.  CFWS is "commented folding
white space" that encompasses folding white space and comments, where
comments are parenthesis-delimited strings.  This is relevant to the way you
described Outlook presenting some addresses and is the most serious
difference from the standards.

Even in RFC822, comments were permitted but expressly ignored in address
strings, so Microsoft's practice is completely broken.  RFC2822 specifically
says that comments SHOULD NOT be included in address fields, as legacy
implementations sometimes interpret the comments.  Apparently, we are now a
legacy application because MS has forced us to interpret the content of
comments in order to get the correct address-list from their broken MUA.

3.4. Address Specification

   Addresses occur in several message header fields to indicate senders
   and recipients of messages.  An address may either be an individual
   mailbox, or a group of mailboxes.

address         =       mailbox / group

mailbox         =       name-addr / addr-spec

name-addr       =       [display-name] angle-addr

angle-addr      =       [CFWS] "<" addr-spec ">" [CFWS] / obs-angle-addr

group           =       display-name ":" [mailbox-list / CFWS] ";"

display-name    =       phrase

mailbox-list    =       (mailbox *("," mailbox)) / obs-mbox-list

address-list    =       (address *("," address)) / obs-addr-list

   A mailbox receives mail.  It is a conceptual entity which does not
   necessarily pertain to file storage.  For example, some sites may
   choose to print mail on a printer and deliver the output to the
   addressee's desk.  Normally, a mailbox is comprised of two parts: (1)
   an optional display name that indicates the name of the recipient
   (which could be a person or a system) that could be displayed to the
   user of a mail application, and (2) an addr-spec address enclosed in
   angle brackets ("<" and ">").  There is also an alternate simple form
   of a mailbox where the addr-spec address appears alone, without the
   recipient's name or the angle brackets.  The Internet addr-spec
   address is described in section 3.4.1.

   Note: Some legacy implementations used the simple form where the
   addr-spec appears without the angle brackets, but included the name
   of the recipient in parentheses as a comment following the addr-spec.
   Since the meaning of the information in a comment is unspecified,
   implementations SHOULD use the full name-addr form of the mailbox,
   instead of the legacy form, to specify the display name associated
   with a mailbox.  Also, because some legacy implementations interpret
   the comment, comments generally SHOULD NOT be used in address fields
   to avoid confusing such implementations.

   When it is desirable to treat several mailboxes as a single unit
   (i.e., in a distribution list), the group construct can be used.  The
   group construct allows the sender to indicate a named group of
   recipients. This is done by giving a display name for the group,
   followed by a colon, followed by a comma separated list of any number
   of mailboxes (including zero and one), and ending with a semicolon.
   Because the list of mailboxes can be empty, using the group construct
   is also a simple way to communicate to recipients that the message
   was sent to one or more named sets of recipients, without actually
   providing the individual mailbox address for each of those


Seth Goodman

More information about the spambayes-dev mailing list