At 22:26 -0400 04/09/2009, Barry Warsaw wrote:
There are really two ways to look at an email message. It's either an unstructured blob of bytes, or it's a structured tree of objects. Those objects have headers and payload. The payload can be of any type, though I think it generally breaks down into "strings" for text/
The email package isn't a perfect mapping to this, which is something I want to improve. That aside, I think storing a message in a database means storing some or all of the headers separately from the byte stream (or text?) of its payload. That's for non-multipart types. It would be more complicated to represent a message tree of course.
Storing an email message in a database does mean storing some of the header fields as database fields, but the set of email header fields is open, so any "unused" fields in a message must be stored elsewhere. It isn't useful to just have a bag of name/value pairs in a table. General message MIME payload trees don't map well to a database either, unless one wants to get very relational. Sometimes the database needs to represent the entire email message, header fields and MIME tree, but only if it is an email program and usually not even then. Usually, the database has a specific purpose, and can be designed for the data it cares about; it may choose to keep the original message as bytes.
It does seem to make sense to think about headers as text header names and text header values. Of course, header values can contain almost anything and there's an encoding to bring it back to 7-bit ASCII, but again, you really have two views of a header value. Which you want really depends on your application.
I think of header fields as having text-like names (the set of allowed characters is more than just text, though defined headers don't make use of that), but the data is either bytes or it should be parsed into something appropriate: text for unstructured fields like Subject:, a list of addresses for address fields like To:. Many of the structured header fields have a reasonable mapping to text; certainly this is true for adress header fields. Content-Type header fields are barely text, they can be so convolutedly structured, but I suppose one could flatten one of them to text instead of bytes if the user wanted. It's not very useful, though, except for debugging (either by the programmer or the recipient who wants to know what was cleaned from the message).
Maybe you just care about the text of both the header name and value. In that case, I think you want the values as unicodes, and probably the headers as unicodes containing only ASCII. So your table would be strings in both cases. OTOH, maybe your application cares about the raw underlying encoded data, in which case the header names are probably still strings of ASCII-ish unicodes and the values are bytes. It's this distinction (and I think the competing use cases) that make a true Python 3.x API for email more complicated.
If a database stores the Subject: header field, it would be as text. The various recipient address fields are a one message to many names and addresses mapping, and need a related table of name/address fields, with each field being text. The original message (or whatever part of it one preserves) should be bytes. I don't think this complicates the email package API; rather, it just shows where generality is needed.
Thinking about this stuff makes me nostalgic for the sloppy happy days of Python 2.x
You now have the opportunity to finally unsnarl that mess. It is not an