[Mailman-Users] Garbled headers - was: gmail marks mailman confirmation mail as spam...

Mon Jun 15 19:17:09 CEST 2009

I am trying to move this thread to email-sig at python.org since the
underlying issue is in the email package. Further, since as of Mailman
2.1.12, we no longer install a Mailman specific version of the email
package, it really has to be addressed in the email package.

Stephen J. Turnbull wrote:
> Mark Sapiro writes:
> 
>  > I think there is a minor bug in decode_header() in that it won't
>  > recognize a RFC 2047 encoded word in a comment if the encoded word is
>  > not separated by whitespace from the ")" that terminates the comment.
>  > However, this is the only place where an encoded word need not be
>  > followed by whitespace or the end of the header.
> 
> Indeed that's a bug.  I gather that you're saying that this bug is not
> the cause of the OP's problem, though?

Correct.

>  > The Subject: header above is non-compliant in two respects. It is too
>  > long.  [...]  However, decode_header will accept it anyway and do
>  > the right thing.
> 
> As it should, according to the Postel Principle.  Anyway, IIRC the
> length limit is a SHOULD NOT, not a MUST NOT, right?

The RFC (8|28|53)22 limits are MUST BE <= 998 and SHOULD BE <= 78. RFC
2047 seems to want to impose stricter limits on encoded words, but
unfortunately does not use the defined terms MUST and SHOULD. Section 2
says in part:

   An 'encoded-word' may not be more than 75 characters long, including
   'charset', 'encoding', 'encoded-text', and delimiters.  If it is
   desirable to encode more text than will fit in an 'encoded-word' of
   75 characters, multiple 'encoded-word's (separated by CRLF SPACE) may
   be used.

   While there is no limit to the length of a multiple-line header
   field, each line of a header field that contains one or more
   'encoded-word's is limited to 76 characters.

so it is not clear whether these are 'recommendations' or
'requirements'. In any case, email.header.decode_header() is not
enforcing any limits so we are being generous in what we accept in this
respect.

>  > real problem is item (1) in section 5 of the RFC says in part:
>  > 
>  >     Ordinary ASCII text and 'encoded-word's may appear together in the
>  >     same header field.  However, an 'encoded-word' that appears in a
>  >     header field defined as '*text' MUST be separated from any adjacent
>  >     'encoded-word' or 'text' by 'linear-white-space'.
>  > 
>  > The header above does not comply with this.
> 
> Agreed, but I think that by default[1] email should try to parse this
> header as the user intended it.  It's not like encoded-words are that
> easy to confuse with intended text; it's unlikely that changing
> 'linear-white-space' above to 'linear-white-space or specials' would
> harm anyone.

I fully agree. There is a regexp (ecre) in email/header.py that ends
with the lookahead assertion "(?=[ \t]|$)". Even in "strict mode", I
think the lookahead needs to accept ")" as well as space and tab, but I
think by default, it should just be removed.

>  > This is a problem with the MUA (mail client) that encoded the Subject:
>  > header in the first place.
> 
> Agreed, but I think following the Postel Principle here is likely to
> do less harm than adhering strictly to the RFC.

I agree here too, and note that some MUAs (all three I tried including
mutt and Thunderbird) decode the original header as intended.

> That said, I'm not in a position to contribute code, and this is a
> pretty invasive change, so the user is unlikely to see a version of
> Mailman that handles this any time soon.  They are likely to have more
> luck switching clients.
> 
> Footnotes: 
> [1]  Ie, there should be an option to be strict.
> 
> 

-- 
Mark Sapiro <mark at msapiro.net>        The highway is for gamblers,
San Francisco Bay Area, California    better use your sense - B. Dylan