[Email-SIG] email.header.decode_header eats my spaces

Wed Mar 28 17:45:31 CEST 2007

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Mar 27, 2007, at 8:06 PM, Tokio Kikuchi wrote:

> Well, this will surely break my contribution on Mailman 2.2  
> CookHeaders.py where unifying the code for subject prefix munging  
> for both ascii and rfc2047.  :-(
>
> Almost all the MUAs do subject munging by adding 'Re:' and  
> adjusting the header length.  This direction of patching means  
> Python email package can't no more be used for eg. webmail  
> application.  If I understand correctly of course.

Tokio, I'd like to understand more about why you think these two  
cases will break.  In the meantime, let me explain my understanding  
of rfc2047 and how and were I think we comply and don't comply.  If  
we get agreement on that, then we can decide what the right solution is.

So there are 4 cases we need to handle, ascii+ascii ascii+encoded,  
encoded+ascii, encoded+encoded.  Here's what the email package  
currently does in these cases (slightly out of order):

encoded+encoded:

 >>> h = Header()
 >>> h.append('hello', 'utf-8')
 >>> h.append('world', 'utf-8')
 >>> print h
=?utf-8?q?hello?= =?utf-8?q?world?=
 >>> print unicode(h)
helloworld

I think we can all agree that we do this correctly.  The rfc is  
explicitly clear that all "linear-white-space" between the two  
encoded parts must be ignored.  Clearly we could split the line on  
that linear-white-space and it would make no difference.

ascii+encoded

 >>> h = Header()
 >>> h.append('hello', 'us-ascii')
 >>> h.append('world', 'utf-8')
 >>> print h
hello =?utf-8?q?world?=
 >>> print unicode(h)
hello world

Here again, I think we're doing the right thing, although IMO the rfc  
is somewhat ambiguous.  While it's clear about whitespace between  
encoded words, it is /not/ explicit about linear-white-space between  
unencoded and encoded parts.  However, if you look at the second  
example in section 8 of the rfc, this implies that linear-white-space  
is /not/ ignored when decoding and concatenating.

To me, this is a flaw in the rfc because there's no way to /avoid/  
whitespace between unencoded and encoded parts!  The separating  
whitespace is required in order to comply with the parsing rules in  
the rfc, but then you're left with whitespace that is in some  
undefined way significant.  The only way to avoid that space between  
the words is to encode both parts.

But maybe that example is wrong.  Personally, I'd prefer to interpret  
unicode(h) above as 'helloworld' so that the rules about linear-white- 
space between unencoded and encoded parts is exactly the same as for  
between two encoded parts.  I really have no way of knowing what the  
intention of the rfc is here, so perhaps we need a flag on the Header  
class (or in the .append() method) to specify which interpretation  
the user wants.

If the separating space is treated the same in this case, then our  
folding rules can be exactly the same.  Otherwise things get more  
complicated because we probably ought to be preserving the whitespace  
for when we unfold (more on that in a separate followup).

 >>> h.append('hello', 'utf-8')
 >>> h = Header()
 >>> h.append('hello', 'utf-8')
 >>> h.append('world', 'us-ascii')
 >>> print h
=?utf-8?q?hello?= world
 >>> print unicode(h)
hello world

More of the same.

 >>> h = Header()
 >>> h.append('hello', 'us-ascii')
 >>> h.append('world', 'us-ascii')
 >>> print h
hello world
 >>> print unicode(h)
helloworld

I think we're nearly correct here.  The unicode version is what I'd  
expect, but the string version is not.  I think in both cases we  
should print 'helloworld'.

- -Barry

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.5 (Darwin)

iQCVAwUBRgqNnHEjvBPtnXfVAQIFmQP+J4ud9R/hvBupIcUpZNWFntzdcVPPHGPq
vTNycMm+9pvaU7KFbIU2LabnQGUGZ+yycFGl8WTTtIddad6DGPBGfeGX2jSOk4XB
MpakU5JBO1/uP5zB1wC13yzZlTXVBqyKntNr8Z1VsAHUtzC9EIJhp3xlbUEyqWgW
WuhUS4wcMgI=
=nffu
-----END PGP SIGNATURE-----