[Email-SIG] email.header.decode_header eats my spaces

Thu Mar 29 02:13:23 CEST 2007

Barry Warsaw wrote:

> ascii+encoded
> 
>  >>> h = Header()
>  >>> h.append('hello', 'us-ascii')
>  >>> h.append('world', 'utf-8')
>  >>> print h
> hello =?utf-8?q?world?=
>  >>> print unicode(h)
> hello world
> 
> Here again, I think we're doing the right thing, although IMO the rfc is 
> somewhat ambiguous.  While it's clear about whitespace between encoded 
> words, it is /not/ explicit about linear-white-space between unencoded 
> and encoded parts.  However, if you look at the second example in 
> section 8 of the rfc, this implies that linear-white-space is /not/ 
> ignored when decoding and concatenating.
> 
> To me, this is a flaw in the rfc because there's no way to /avoid/ 
> whitespace between unencoded and encoded parts!

Well, it looks to me that RFC2047 prohibits this at least in header 
text.  An example for comment text in section 8 states:

    (=?ISO-8859-1?Q?a?= b)                      (a b)

            Within a 'comment', white space MUST appear between an
            'encoded-word' and surrounding text.  [Section 5,
            paragraph (2)].  However, white space is not needed between
            the initial "(" that begins the 'comment', and the
            'encoded-word'.

The word MUST means there is no way omitting spaces between encoded-word 
and surrounding ascii text.  The '(' before the encoded-word appears to 
violate this but it is a higher syntax token.

Current email.header violate this example because we have no class which 
recognizes comment in a structured header.

 >>> from email.header import *
 >>> s = '(=?ISO-8859-1?Q?a?= b)'
 >>> l = decode_header(s)
 >>> l
[('(', None), ('a', 'iso-8859-1'), ('b)', None)]
 >>> h = make_header(l)
 >>> print h
( =?iso-8859-1?q?a?= b)
  ^ notice this extra space.

This current behavior is correct if '(' is in a *text field and the 
example is not appropriate.  The problem in email.header module is it 
can not distiguish between the structured and unstructured (text only) 
headers.  The Header class may have a member function like 
'add_comment', IMHO.

> But maybe that example is wrong.  Personally, I'd prefer to interpret 
> unicode(h) above as 'helloworld' so that the rules about 
> linear-white-space between unencoded and encoded parts is exactly the 
> same as for between two encoded parts.  I really have no way of knowing 
> what the intention of the rfc is here, so perhaps we need a flag on the 
> Header class (or in the .append() method) to specify which 
> interpretation the user wants.

RFC2047 is clear in that 'encoded-word' should be treated as a plain 
english word which is separated by space (or higher syntatic token like 
'(', ')', ';' etc.  Only exception is 'encoded-word'--'encoded-word' 
sequence, which may result from wrapping a long line because it tends to 
become longer when encoding.

>  >>> h = Header()
>  >>> h.append('hello', 'us-ascii')
>  >>> h.append('world', 'us-ascii')
>  >>> print h
> hello world
>  >>> print unicode(h)
> helloworld
> 
> I think we're nearly correct here.  The unicode version is what I'd 
> expect, but the string version is not.  I think in both cases we should 
> print 'helloworld'.

No.  email.header module is not a word processor.  Because RFC2047 is 
dealing with 'word's, we should treat these parts as 'word's for 
consitency.  unicode() function should be fixed.  If these words are to 
be concatnated without a space, it should be done outside header module.

Remember we are not making an almighty word processor but an RFC 
compliant module.

Cheers,
-- 
Tokio Kikuchi, tkikuchi at is.kochi-u.ac.jp
http://weather.is.kochi-u.ac.jp/