[Email-SIG] email.header.decode_header eats my spaces

Barry Warsaw barry at python.org
Thu Mar 29 06:24:42 CEST 2007


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Mar 28, 2007, at 8:13 PM, Tokio Kikuchi wrote:

> Well, it looks to me that RFC2047 prohibits this at least in header  
> text.  An example for comment text in section 8 states:
>
>    (=?ISO-8859-1?Q?a?= b)                      (a b)
>
>            Within a 'comment', white space MUST appear between an
>            'encoded-word' and surrounding text.  [Section 5,
>            paragraph (2)].  However, white space is not needed between
>            the initial "(" that begins the 'comment', and the
>            'encoded-word'.
>
> The word MUST means there is no way omitting spaces between encoded- 
> word and surrounding ascii text.  The '(' before the encoded-word  
> appears to violate this but it is a higher syntax token.
>
> Current email.header violate this example because we have no class  
> which recognizes comment in a structured header.

Thanks Tokio, I agree with all of this.  I think you're right in  
identifying that the problem here is that we don't really have any  
way to understand the semantics of the a particular header's body.

> This current behavior is correct if '(' is in a *text field and the  
> example is not appropriate.  The problem in email.header module is  
> it can not distiguish between the structured and unstructured (text  
> only) headers.  The Header class may have a member function like  
> 'add_comment', IMHO.

I think we might want to try to address this in a more general and  
extensible way, so that we can support future semantically meaningful  
headers.

>>  >>> h = Header()
>>  >>> h.append('hello', 'us-ascii')
>>  >>> h.append('world', 'us-ascii')
>>  >>> print h
>> hello world
>>  >>> print unicode(h)
>> helloworld
>> I think we're nearly correct here.  The unicode version is what  
>> I'd expect, but the string version is not.  I think in both cases  
>> we should print 'helloworld'.
>
> No.  email.header module is not a word processor.  Because RFC2047  
> is dealing with 'word's, we should treat these parts as 'word's for  
> consitency.  unicode() function should be fixed.  If these words  
> are to be concatnated without a space, it should be done outside  
> header module.

Right, but these parts aren't being encoded, and yet we've still  
stuck a space between the parts that didn't exist there before.  I'd  
feel better about it if we encoded these chunks too.

- -Barry

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.5 (Darwin)

iQCVAwUBRgs/k3EjvBPtnXfVAQLv3gQAl3598ge8qge7epkdqqjBq4F+478374z6
DuvfcBWeBGNZ/b4PEesPbtOwUKprz9mp988N1aoiMWiBa3p5OMQvhIl6q0w1d7Tj
Gm2aCxrXa2JRfkFsj+VygDalK8aYT0XcDxh+56vCjfwhTvKHz1MmkAEwWLbJ6Cp/
GxGfW4l6a6g=
=7akO
-----END PGP SIGNATURE-----


More information about the Email-SIG mailing list