[Python-Dev] [Email-SIG] headers api for email package

Mon Apr 13 19:15:20 CEST 2009

Barry Warsaw writes:
 > On Apr 11, 2009, at 8:39 AM, Chris Withers wrote:
 > 
 > > Barry Warsaw wrote:
 > >> >>> message['Subject']
 > >> The raw bytes or the decoded unicode?
 > >
 > > A header object.
 > 
 > Yep.  You got there before I did. :)
 > 
 > >> Okay, so you've picked one.  Now how do you spell the other way?
 > >
 > > str(message['Subject'])
 > 
 > Yes for unstructured headers like Subject.  For structured headers...  
 > hmm.

Well, suppose we get really radical here.  *People* see email as
(rich-)text.  So ... message['Subject'] returns an object, partly to
be consistent with more complex headers' APIs, but partly to remind us
that nothing in email is as simple as it seems.  Now,
str(message['Subject']) is really for presentation to the user, right?
OK, so let's make it a presentation function!  Decode the MIME-words,
optionally unfold folded lines, optionally compress spaces, etc.  This
by default returns the subject field as a single, possibly quite long,
line.  Then a higher-level API can rewrap it, add fonts etc, for fancy
presentation.  This also suggests that we don't the field tag (ie,
"Subject") to be part of this value.

Of course a *really* smart higher-level API would access structured
headers based on their structure, not on the one-size-fits-all str()
conversion.

Then MTAs see email as a string of octets.  So guess what:

 > > bytes(message['Subject'])

gives wire format.  Yow!  I think I'm just joking.  Right?

 > >> Now, setting headers.  Sometimes you have some unicode thing and  
 > >> sometimes you have some bytes.  You need to end up with bytes in  
 > >> the ASCII range and you'd like to leave the header value unencoded  
 > >> if so.  But in both cases, you might have bytes or characters  
 > >> outside that range, so you need an explicit encoding, defaulting to  
 > >> utf-8 probably.
 > >> >>> Message.set_header('Subject', 'Some text', encoding='utf-8')
 > >> >>> Message.set_header('Subject', b'Some bytes')
 > >
 > > Where you just want "a damned valid email and stop making my life  
 > > hard!":

-1  I mean, yeah, Brother, I feel your pain but it just isn't that
easy.  If that were feasible, it would be *criminal* to have a
.set_header() method at all!  In fact,

 > > Message['Subject']='Some text'

is going to (a) need to take *only* unicodes, or (b) raise Exceptions
at the slightest provocation when handed bytes.

And things only get worse if you try to provide this interface for say
"From" (let alone "Content-Type").  Is it really worth doing the
mapping interface if it's only usable with free-form headers (ie, only
Subject among the commonly used headers)?

 > Yes.  In which case I propose we guess the encoding as 1) ascii, 2)  
 > utf-8, 3) wtf?

Uh, what guessing?  If you don't know what you have but you believe it
to be a valid header field, then presumably you got it off the wire
and it's still in bytes and you just spit it out on the wire without
trying to decode or encode it.  But as I already said, I think that's
a bad idea.  Otherwise, you should have a unicode, and you simply look
at the range of the string.  If it fits in ASCII, Bob's your uncle.
If not, Bob's your aunt (and you use UTF-8).

 > > Where you care about what encoding is used:
 > >
 > > Message['Subject']=Header('Some text',encoding='utf-8')
 > 
 > Yes.
 > 
 > > If you have bytes, for whatever reason:
 > >
 > > Message['Subject']=b'some bytes'.decode('utf-8')
 > >
 > > ...because only you know what encoding those bytes use!
 > 
 > So you're saying that __setitem__() should not accept raw bytes?

How do you distinguish "raw" bytes from "encoded bytes"?
__setitem__() shouldn't accept bytes at all.  There should be an API
which sets a .formatted_for_the_wire member, and it should have a
"validate" option (ie, when true the API attempts to parse the header
and raises an exception if it fails to do so; when false, it assumes
you know what you're doing and will send out the bytes verbatim).