[Python-3000] should rfc822 accept text io or binary io?

Tue Aug 7 20:31:30 CEST 2007

On 8/7/07, Guido van Rossum <guido at python.org> wrote:
> On 8/7/07, Jeremy Hylton <jeremy at alum.mit.edu> wrote:
> > On 8/7/07, Guido van Rossum <guido at python.org> wrote:
> > > On 8/7/07, Jeremy Hylton <jeremy at alum.mit.edu> wrote:
> > > > Hmmm.  Should we being using the email package to parse HTTP headers?
> > > > RFC 2616 says that HTTP headers follow the "same generic format" as
> > > > RFC 822, but RFC 822 says headers are ASCII and RFC 2616 says headers
> > > > are arbitrary 8-bit values.  You'd need to parse them differently.
> > >
> > > I'm confused (and too lazy to read the RFCs). How can you have case
> > > insensitivity (as HTTP clearly has) if the headers are arbitrary 8-bit
> > > values? Assuming they mean it's an ASCII superset, does that mean that
> > > HTTP doesn't have case insensitivity for bytes with values > 127?
> >
> > For HTTP, the header names need to be ASCII, but the values can be
> > great > 127.  I haven't read enough of the spec to know which header
> > values might include binary data and how you are supposed to interpret
> > them.  Assuming that the spec allows OCTET instead of token (which is
> > ASCII) for a reason, it suggests that the header values need to be
> > bytes.
>
> Bizarre. I'm not aware of any HTTP header that requires *binary*
> values. I can imagine though that they may contain *encoded* text and
> that they are leaving the encoding up to separate negotiations between
> client and server, or another header, or specified explicitly by the
> header, etc. It can't be pure binary because it's still subject to the
> \r\n line terminator.

I did a little more reading.

"""The TEXT rule is only used for descriptive field contents and values
   that are not intended to be interpreted by the message parser. Words
   of *TEXT MAY contain characters from character sets other than ISO-
   8859-1 [22] only when encoded according to the rules of RFC 2047
   [14].

       TEXT           = <any OCTET except CTLs,
                        but including LWS>
"""

The odd thing here is that RFC 2047 (MIME) seems to be about encoding
non-ASCII character sets in ASCII.  So the spec is kind of odd here.
The actual bytes on the wire seem to be ASCII, but they may an
interpretation where those ASCII bytes represent a non-ASCII string.
So the shared parsing with email/rfc822 does seem reasonable.

> > > In general I'm against writing polymorphic code that tries to work for
> > > strings as well as bytes, except very small algorithms. For larger
> > > amounts of code, you almost always run into the need for literals or
> > > hashing or case conversion or other differences (e.g. \n vs. \r\n when
> > > doing I/O).
> > >
> > > I think it's conceptually cleaner to pick a particular type for an API
> > > and stick to it. E.g. sockets, binary files (io.RawIOBase) and *dbm
> > > files read/write bytes; text files (io.TextIOBase) read/write strings.
> >
> > It certainly makes rfc822 tricky to update.  Is it intended to work
> > with files or sockets?  In Python 2.x, it works with either.  If we
> > have some future email/rfc822/httpheaders library that parses the
> > "generic format," will it work with sockets or files or will we have
> > two versions?
>
> It never worked with socket object, did it? If it worked with the
> objects returned by makefile(), why not use text mode ("r" or "w") as
> the mode arg? (Then you can even specify an encoding.) IMO it makes
> more sense to treat rfc822 headers as text, since they are for all
> intents and purposes meant to be human-readable, and there's case
> insensitivity implied.

We use the same makefile() object to read the headers and the body.
We can't trust the body is text.  I guess we could change the code to
use two different makefile() calls--a text one for headers that is
closed when the headers are done, and a binary one for the body.

Jeremy