[Python-3000] should rfc822 accept text io or binary io?

Tue Aug 7 19:38:44 CEST 2007

On 8/7/07, Guido van Rossum <guido at python.org> wrote:
> On 8/7/07, Jeremy Hylton <jeremy at alum.mit.edu> wrote:
> > On 8/6/07, Fred Drake <fdrake at acm.org> wrote:
> > > On Aug 6, 2007, at 4:46 PM, skip at pobox.com wrote:
> > > > I thought rfc822 was going away.  From the current module
> > > > documentation:
> > > > ...
> > > > Shouldn't rfc822 be gone altogether in Python 3?
> > >
> > > Yes.  And the answers to Jeremy's questions about what sort of IO is
> > > appropriate for the email package should be left to the email-sig as
> > > well, I suspect.  It's good that they've come up.
> >
> > Hmmm.  Should we being using the email package to parse HTTP headers?
> > RFC 2616 says that HTTP headers follow the "same generic format" as
> > RFC 822, but RFC 822 says headers are ASCII and RFC 2616 says headers
> > are arbitrary 8-bit values.  You'd need to parse them differently.
>
> I'm confused (and too lazy to read the RFCs). How can you have case
> insensitivity (as HTTP clearly has) if the headers are arbitrary 8-bit
> values? Assuming they mean it's an ASCII superset, does that mean that
> HTTP doesn't have case insensitivity for bytes with values > 127?

For HTTP, the header names need to be ASCII, but the values can be
great > 127.  I haven't read enough of the spec to know which header
values might include binary data and how you are supposed to interpret
them.  Assuming that the spec allows OCTET instead of token (which is
ASCII) for a reason, it suggests that the header values need to be
bytes.

> > I also wonder if it makes sense for httplib to depend on email.  If it
> > is possible to write generic code, maybe it belongs in a common
> > library rather than in either email or httplib.
> >
> > I meant my original email to ask a more general question:  Does anyone
> > have some suggestions about how to design libraries that could deal
> > with bytes or strings?  If an HTTP header value contains 8-bit binary
> > data, does the client application expect bytes or a string in some
> > encoding?
> >
> > If you have a library that consumes file-like objects, how do you deal
> > with bytes vs. strings?  Do you have two constructor options so that
> > the client can specify what kind of output the file-like object
> > products?  Do you try to guess?  Do you just write code assuming
> > strings and let it fail on a bad lower() call when it gets bytes?
>
> In general I'm against writing polymorphic code that tries to work for
> strings as well as bytes, except very small algorithms. For larger
> amounts of code, you almost always run into the need for literals or
> hashing or case conversion or other differences (e.g. \n vs. \r\n when
> doing I/O).
>
> I think it's conceptually cleaner to pick a particular type for an API
> and stick to it. E.g. sockets, binary files (io.RawIOBase) and *dbm
> files read/write bytes; text files (io.TextIOBase) read/write strings.

It certainly makes rfc822 tricky to update.  Is it intended to work
with files or sockets?  In Python 2.x, it works with either.  If we
have some future email/rfc822/httpheaders library that parses the
"generic format," will it work with sockets or files or will we have
two versions?

Jeremy