[Python-bugs-list] [ python-Bugs-594893 ] printing email object deletes whitespace

Tue, 10 Sep 2002 13:52:21 -0700

Bugs item #594893, was opened at 2002-08-14 00:59
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=594893&group_id=5470

Category: Python Library
Group: Python 2.3
Status: Open
Resolution: None
Priority: 5
Submitted By: Skip Montanaro (montanaro)
Assigned to: Barry A. Warsaw (bwarsaw)
Summary: printing email object deletes whitespace

Initial Comment:
I certain situations when printing email Message objects (I think), 
whitespace in headers disappears.  The attached zip file 
demonstrates this problem.  In email.orig, there is a line break 
followed by a TAB in the X-Vm-v5-Data header at the end of the 
first continuation line.  In email.new, which was generated by 
printing an email.Message object, the line break and TAB are gone, 
but no SPACE was inserted in their place.

This example is from a larger program which reads in a Unix 
mailbox like so:

    msgdict = {}
    i = 0
    for msg in mailbox.PortableUnixMailbox(f,
                                 email.Parser.Parser().parse):
        subj = msg["subject"]
        item = msgdict.get(subj) or []
        item.append((i, msg))
        msgdict[subj] = item
        i += 1

runs through msgdict and deletes a bunch of messages matching 
various criteria, then prints out those which remain retaining the 
relative order they had in the original mailbox:

    msglist = []
    for val in msgdict.values():
        msglist.extend(val)
    msglist.sort()
    for i,msg in msglist:
        print msg

email.orig was plucked from the input mailbox and email.new from 
the output mailbox.

----------------------------------------------------------------------

>Comment By: Barry A. Warsaw (bwarsaw)
Date: 2002-09-10 16:52

Message:
Logged In: YES 
user_id=12800

> Will that also solve the problem of space getting deleted?

I'm not sure, give it a try! :)  If not, then I think we'll
add maxheaderlen=-1 to mean do no wrapping or filling of
header values (which *should* take care of the problem).

----------------------------------------------------------------------

Comment By: Skip Montanaro (montanaro)
Date: 2002-09-10 16:45

Message:
Logged In: YES 
user_id=44345

> Why treat just the X-* header specially?

Because of all the possible headers they are the ones we know the least 
about format-wise.  From my selfish perspective, they are the ones I am 
having the most trouble with... ;-)

I'd be happy to experiment with the maxheaderlen argument.  I wasn't 
aware it existed.  Will that also solve the problem of space getting 
deleted?

----------------------------------------------------------------------

Comment By: Barry A. Warsaw (bwarsaw)
Date: 2002-09-10 16:32

Message:
Logged In: YES 
user_id=12800

Why treat just the X-* header specially?

BTW, the reason headers are wrapped in the first place is
that RFC 2822 specifies hard and soft limits to header
lengths.  I think the hard limit is 998 characters, but it
is recommended that no header be longer than 78 characters
without wrapping.

OTOH, a header like the X-VM-... header is for internal use
only, so it's probably never used outside of your own
applications.  Note that you can suppress all wrapping by
setting the maxheaderlen argument in the Generator's
constructor to some outrageously large value (try 2000). 
Maybe a negative value should indicate that no wrapping of
any headers be done?  (Maybe limited to just non-encoded
headers?)

----------------------------------------------------------------------

Comment By: Skip Montanaro (montanaro)
Date: 2002-09-10 16:24

Message:
Logged In: YES 
user_id=44345

A slightly less wild idea - why not just suppress all folding/reformatting for 
X-* headers and instead always emit the raw header value that was in the 
original message?  That should solve the problem in the short term and 
allow you to come up with a suitable API for the longer term.

----------------------------------------------------------------------

Comment By: Barry A. Warsaw (bwarsaw)
Date: 2002-09-10 16:04

Message:
Logged In: YES 
user_id=12800

It doesn't, it just suggests that when wrapping a line:

   [...] folding SHOULD be limited to
   placing the CRLF at higher-level syntactic breaks.  For
instance, if
   a field body is defined as comma-separated values, it is
recommended
   that folding occur after the comma separating the
structured items in
   preference to other places where the field could be
folded, even if
   it is allowed elsewhere.

So it's really up to the application in most cases to define
what the higher-level syntactic breaks should be.  Problem
is, the email package currently has no way for applications
to tell it what to do for particular headers, so email tries
a couple of simplistic generalized splitting algorithms
(semi's then whitespace).

Wild thought: allow each header to be assigned a splitting
tokenizer method which does the "higher-level syntactic
breaks".  Tricky bits are to provide a useable API (where? 
in the Generator or in Message?), and what to do about
encoded headers vs. ascii headers.

----------------------------------------------------------------------

Comment By: Skip Montanaro (montanaro)
Date: 2002-09-10 15:51

Message:
Logged In: YES 
user_id=44345

Hmmm...  How can RFC 2822 presume to know anything about the syntax 
of X-* headers?  Perhaps they should just be left alone...

----------------------------------------------------------------------

Comment By: Barry A. Warsaw (bwarsaw)
Date: 2002-09-10 13:29

Message:
Logged In: YES 
user_id=12800

Skip, you've got two difficult examples here.  RFC 2822
recommends splitting lines at "the highest syntactic level"
possible, but that differs depending on the semantics of the
header.  By default, Header._split_ascii() splits first on
semicolons (for multiple parameter headers) and then on
whitespace.  Your two examples exploit weaknesses in this
algorithm.

In the first case, X-VM... has the syntax of a lisp
expression.  A coarser way to look at the contents would be
to try to keep "-delimited strings without line breaks.  The
email package doesn't know anything about either of these
syntactic levels.

In the second case, you actually have X-Face data which
contains a semi-colon, so the split mentioned above does the
wrong thing in this case.

I'm not sure what the best answer is.  We can't hardcode too
much syntactic information into the Header class.  Do we
need some kind of registration/callback mechanism so that
applications can create their own tokenization routines for
providing non-breaking tokens to the ascii_split() method? 
Yeesh.

I'm up for suggestions.  I can add a hack so that at least
the X-VM header doesn't *lose* information when printed, but
it's just a hack, so I'm not sure what the best solution is.

----------------------------------------------------------------------

Comment By: Skip Montanaro (montanaro)
Date: 2002-08-29 10:45

Message:
Logged In: YES 
user_id=44345

Hmmm...  Sometimes seems to *add* whitespace as well.  Here's an 
example using the X-Face: header:

Before:

X-Face: 
$LeJ8}Gzj%b'dmF:@bMiTrpT|UL=3O!CG~3;}dS[43`qefo('''9?B=2a0u*B4u+a)$"DYl
 S

After:

X-Face: $LeJ8}Gzj%b'dmF:@bMiTrpT|UL=3O!CG~3;
	}dS[43`qefo('''9?B=2a0u*B4u+a)$"DYlS

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=594893&group_id=5470