Barry Warsaw writes:
On Apr 11, 2009, at 8:39 AM, Chris Withers wrote:
Barry Warsaw wrote:
message['Subject'] The raw bytes or the decoded unicode?
A header object.
Yep. You got there before I did. :)
Okay, so you've picked one. Now how do you spell the other way?
Yes for unstructured headers like Subject. For structured headers...
Well, suppose we get really radical here. People see email as (rich-)text. So ... message['Subject'] returns an object, partly to be consistent with more complex headers' APIs, but partly to remind us that nothing in email is as simple as it seems. Now, str(message['Subject']) is really for presentation to the user, right? OK, so let's make it a presentation function! Decode the MIME-words, optionally unfold folded lines, optionally compress spaces, etc. This by default returns the subject field as a single, possibly quite long, line. Then a higher-level API can rewrap it, add fonts etc, for fancy presentation. This also suggests that we don't the field tag (ie, "Subject") to be part of this value.
Of course a really smart higher-level API would access structured headers based on their structure, not on the one-size-fits-all str() conversion.
Then MTAs see email as a string of octets. So guess what:
gives wire format. Yow! I think I'm just joking. Right?
Now, setting headers. Sometimes you have some
unicode thing and
sometimes you have some bytes. You need to end up with bytes in
the ASCII range and you'd like to leave the header value unencoded
if so. But in both cases, you might have bytes or characters
outside that range, so you need an explicit encoding, defaulting to
Message.set_header('Subject', 'Some text', encoding='utf-8') Message.set_header('Subject', b'Some bytes')
Where you just want "a damned valid email and stop making my
-1 I mean, yeah, Brother, I feel your pain but it just isn't that easy. If that were feasible, it would be criminal to have a .set_header() method at all! In fact,
is going to (a) need to take only unicodes, or (b) raise Exceptions at the slightest provocation when handed bytes.
And things only get worse if you try to provide this interface for say "From" (let alone "Content-Type"). Is it really worth doing the mapping interface if it's only usable with free-form headers (ie, only Subject among the commonly used headers)?
Yes. In which case I propose we guess the encoding
as 1) ascii, 2)
utf-8, 3) wtf?
Uh, what guessing? If you don't know what you have but you believe it to be a valid header field, then presumably you got it off the wire and it's still in bytes and you just spit it out on the wire without trying to decode or encode it. But as I already said, I think that's a bad idea. Otherwise, you should have a unicode, and you simply look at the range of the string. If it fits in ASCII, Bob's your uncle. If not, Bob's your aunt (and you use UTF-8).
Where you care about what encoding is used:
If you have bytes, for whatever reason:
...because only you know what encoding those bytes use!
So you're saying that __setitem__() should not accept raw bytes?
How do you distinguish "raw" bytes from "encoded bytes"? __setitem__() shouldn't accept bytes at all. There should be an API which sets a .formatted_for_the_wire member, and it should have a "validate" option (ie, when true the API attempts to parse the header and raises an exception if it fails to do so; when false, it assumes you know what you're doing and will send out the bytes verbatim).