[Python-Dev] [Email-SIG] Dropping bytes "support" in json

Fri Apr 10 05:59:54 CEST 2009

At 22:26 -0400 04/09/2009, Barry Warsaw wrote:

>There are really two ways to look at an email message.  It's either an
>unstructured blob of bytes, or it's a structured tree of objects.
>Those objects have headers and payload.  The payload can be of any
>type, though I think it generally breaks down into "strings" for text/
>* types and bytes for anything else (not counting multiparts).
>
>The email package isn't a perfect mapping to this, which is something
>I want to improve.  That aside, I think storing a message in a
>database means storing some or all of the headers separately from the
>byte stream (or text?) of its payload.  That's for non-multipart
>types.  It would be more complicated to represent a message tree of
>course.

Storing an email message in a database does mean storing some of the header
fields as database fields, but the set of email header fields is open, so
any "unused" fields in a message must be stored elsewhere.  It isn't useful
to just have a bag of name/value pairs in a table.  General message MIME
payload trees don't map well to a database either, unless one wants to get
very relational.  Sometimes the database needs to represent the entire
email message, header fields and MIME tree, but only if it is an email
program and usually not even then.  Usually, the database has a specific
purpose, and can be designed for the data it cares about; it may choose to
keep the original message as bytes.

>It does seem to make sense to think about headers as text header names
>and text header values.  Of course, header values can contain almost
>anything and there's an encoding to bring it back to 7-bit ASCII, but
>again, you really have two views of a header value.  Which you want
>really depends on your application.

I think of header fields as having text-like names (the set of allowed
characters is more than just text, though defined headers don't make use of
that), but the data is either bytes or it should be parsed into something
appropriate:  text for unstructured fields like Subject:, a list of
addresses for address fields like To:.  Many of the structured header
fields have a reasonable mapping to text; certainly this is true for adress
header fields.  Content-Type header fields are barely text, they can be so
convolutedly structured, but I suppose one could flatten one of them to
text instead of bytes if the user wanted.  It's not very useful, though,
except for debugging (either by the programmer or the recipient who wants
to know what was cleaned from the message).

>Maybe you just care about the text of both the header name and value.
>In that case, I think you want the values as unicodes, and probably
>the headers as unicodes containing only ASCII.  So your table would be
>strings in both cases.  OTOH, maybe your application cares about the
>raw underlying encoded data, in which case the header names are
>probably still strings of ASCII-ish unicodes and the values are
>bytes.  It's this distinction (and I think the competing use cases)
>that make a true Python 3.x API for email more complicated.

If a database stores the Subject: header field, it would be as text.  The
various recipient address fields are a one message to many names and
addresses mapping, and need a related table of name/address fields, with
each field being text.  The original message (or whatever part of it one
preserves) should be bytes.  I don't think this complicates the email
package API; rather, it just shows where generality is needed.

>Thinking about this stuff makes me nostalgic for the sloppy happy days
>of Python 2.x

You now have the opportunity to finally unsnarl that mess.  It is not an
insurmountable opportunity.
-- 
____________________________________________________________________
TonyN.:'                       <mailto:tonynelson at georgeanelson.com>
      '                              <http://www.georgeanelson.com/>