[Email-SIG] email.header.decode_header eats my spaces
Barry Warsaw
barry at python.org
Wed Mar 28 17:45:31 CEST 2007
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
On Mar 27, 2007, at 8:06 PM, Tokio Kikuchi wrote:
> Well, this will surely break my contribution on Mailman 2.2
> CookHeaders.py where unifying the code for subject prefix munging
> for both ascii and rfc2047. :-(
>
> Almost all the MUAs do subject munging by adding 'Re:' and
> adjusting the header length. This direction of patching means
> Python email package can't no more be used for eg. webmail
> application. If I understand correctly of course.
Tokio, I'd like to understand more about why you think these two
cases will break. In the meantime, let me explain my understanding
of rfc2047 and how and were I think we comply and don't comply. If
we get agreement on that, then we can decide what the right solution is.
So there are 4 cases we need to handle, ascii+ascii ascii+encoded,
encoded+ascii, encoded+encoded. Here's what the email package
currently does in these cases (slightly out of order):
encoded+encoded:
>>> h = Header()
>>> h.append('hello', 'utf-8')
>>> h.append('world', 'utf-8')
>>> print h
=?utf-8?q?hello?= =?utf-8?q?world?=
>>> print unicode(h)
helloworld
I think we can all agree that we do this correctly. The rfc is
explicitly clear that all "linear-white-space" between the two
encoded parts must be ignored. Clearly we could split the line on
that linear-white-space and it would make no difference.
ascii+encoded
>>> h = Header()
>>> h.append('hello', 'us-ascii')
>>> h.append('world', 'utf-8')
>>> print h
hello =?utf-8?q?world?=
>>> print unicode(h)
hello world
Here again, I think we're doing the right thing, although IMO the rfc
is somewhat ambiguous. While it's clear about whitespace between
encoded words, it is /not/ explicit about linear-white-space between
unencoded and encoded parts. However, if you look at the second
example in section 8 of the rfc, this implies that linear-white-space
is /not/ ignored when decoding and concatenating.
To me, this is a flaw in the rfc because there's no way to /avoid/
whitespace between unencoded and encoded parts! The separating
whitespace is required in order to comply with the parsing rules in
the rfc, but then you're left with whitespace that is in some
undefined way significant. The only way to avoid that space between
the words is to encode both parts.
But maybe that example is wrong. Personally, I'd prefer to interpret
unicode(h) above as 'helloworld' so that the rules about linear-white-
space between unencoded and encoded parts is exactly the same as for
between two encoded parts. I really have no way of knowing what the
intention of the rfc is here, so perhaps we need a flag on the Header
class (or in the .append() method) to specify which interpretation
the user wants.
If the separating space is treated the same in this case, then our
folding rules can be exactly the same. Otherwise things get more
complicated because we probably ought to be preserving the whitespace
for when we unfold (more on that in a separate followup).
>>> h.append('hello', 'utf-8')
>>> h = Header()
>>> h.append('hello', 'utf-8')
>>> h.append('world', 'us-ascii')
>>> print h
=?utf-8?q?hello?= world
>>> print unicode(h)
hello world
More of the same.
>>> h = Header()
>>> h.append('hello', 'us-ascii')
>>> h.append('world', 'us-ascii')
>>> print h
hello world
>>> print unicode(h)
helloworld
I think we're nearly correct here. The unicode version is what I'd
expect, but the string version is not. I think in both cases we
should print 'helloworld'.
- -Barry
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.5 (Darwin)
iQCVAwUBRgqNnHEjvBPtnXfVAQIFmQP+J4ud9R/hvBupIcUpZNWFntzdcVPPHGPq
vTNycMm+9pvaU7KFbIU2LabnQGUGZ+yycFGl8WTTtIddad6DGPBGfeGX2jSOk4XB
MpakU5JBO1/uP5zB1wC13yzZlTXVBqyKntNr8Z1VsAHUtzC9EIJhp3xlbUEyqWgW
WuhUS4wcMgI=
=nffu
-----END PGP SIGNATURE-----
More information about the Email-SIG
mailing list