[Python-3000] should rfc822 accept text io or binary io?

Bill Janssen janssen at parc.com
Sat Aug 18 01:28:14 CEST 2007


> On 8/17/07, Bill Janssen <janssen at parc.com> wrote:
> > > Ideally, the package would be well suited not only for wire-to-wire
> > > and all-internal uses, but also related domains like HTTP and other
> > > RFC 2822-like contexts.
> >
> > But that's exactly why the internal representation should be bytes,
> > not strings.  HTTP's use of MIME, for instance, uses "binary" quite a
> > lot.
> 
> In the specific case of HTTP, it certainly looks like the headers are
> represented on the wire as 7-bit ASCII and could be treated as bytes
> or strings by the header processing code it uses via rfc822.py.  The
> actual body of the response should still be represented as bytes,
> which can be converted to strings by the application.

Note that, in the case of HTTP, both the request message and the
response message may contain MIME-tagged binary data.  And some of the
header values for those message types may contain arbitrary RFC-8859-1
octets, not necessarily encoded.  See sections 4.2 and 2.2 of RFC
2616.

But we're not really interested in those message headers -- that's a
consideration for the HTTP libraries.  I'm just concerned about the
MIME standard, which both HTTP and email use, though in different
ways.  The MIME processing in the "email" module must follow the MIME
spec, RFC 2045, 2046, etc., rather than assume RFC 2821 (SMTP) and RFC
2822 encoding everywhere.  SMTP is only one form of message envelope.

The important thing is that we understand that raw mail messages --
say in MH format in a file -- do not consist of "lines" of "text";
they are complicated binary data structures, often largely composed of
pieces of text encoded in very specific ways.  As such, the raw
message *must* be treated as a sequence of bytes.  And the content of
any body part may also be an arbitrary sequence of bytes (which, in an
RFC 2822 context, must be encoded into ASCII octets).  The values of
any header may be an arbitrary string in an arbitrary language in an
arbitrary character set (see RFCs 2047 and 2231), though it must be
put into the message appropriately encoded as a sequence of octets
which must be drawn from a set of octets which happens to be a subset
of the octets in ASCII.

Maybe all of this argues for separating "mime" and "email" into two
different packages.  And maybe renaming "email" "internet-email" or
"rfc2822-email".

Bill




More information about the Python-3000 mailing list