[Email-SIG] email.header.decode_header eats my spaces

Thu Mar 29 04:58:43 CEST 2007

Tokio Kikuchi writes:

 > Barry Warsaw wrote:

 > > To me, this is a flaw in the rfc because there's no way to /avoid/ 
 > > whitespace between unencoded and encoded parts!
 > 
 > Well, it looks to me that RFC2047 prohibits [deleting whitespace] at
 > least in header text.

That's my understanding as well, for the reasons Tokio gave.

If you want "unicode(h)" ==> "helloworld", you need to encode the
whole string.  (Giving 'hello' the charset 'utf-8' would be a hackish
way of doing this.)

 > The problem in email.header module is it can not distiguish between
 > the structured and unstructured (text only) headers.

Yikes!  I didn't think of it that way before, but now that you mention
it, my spine is freezing.

 > The Header class may have a member function like 'add_comment',
 > IMHO.

IMHO, the Header class should be abstract, and there should be
subclasses that handle dates, lists of addresses, lists of
message-ids, etc. as appropriate to header fields structured in each
particular way.  Only those object handlers appropriate to a given
field would be exposed.  StarTextHeader would the unstructured
derivative of the (implicitly structured) Header class.

Barry again:

 > > I really have no way of knowing what the intention of the rfc is here,
 > > so perhaps we need a flag on the Header class (or in the .append()
 > > method) to specify which interpretation the user wants.

I really don't think that users should be allowed to "specify
interpretation".  RFC 2047 is a "transfer encoding".  Users should
never need to deal with that kind of thing, and it is dangerous to
allow them to do so.  Users (including software clients of the
package, of course) should simply hand objects and text to the Header
class to format according to 2822, 2047, and the definition of each
field's structure.

Nevertheless, given that RFC 2047 (and 2822, for that matter) is
explicitly intended to allow headers to be human-readably formatted
but still machine-parsable, the user should be allowed to express
*preferences* for the formatting, for example qp_preference_function
would be a function of the header contents such that if it returns
true, QP encoding should be used, otherwise BASE64.  But the decision
to use encoded words would not be a user choice.  There might be a
preserve_whitespace_literally preference, in which case the whole
header would have to be RFC 2047 encoded -- but in the case of
structured headers (eg, address lists), you can't simply BASE64 the
whole thing, only the *text components!

And the email package needs to be free to deal with structured headers
appropriately (for example, breaking very long addresses to try to
keep line-length to a reasonable level).  It may not be feasible to
respect user preferences in all cases.

Maybe there could be an escape to allow Sufficiently Smart Users to
format headers "by hand", but its use should be discouraged in favor
of a structured Header subclass that DTRTs.

 > email.header module is not a word processor.

Good slogan!