[Web-SIG] Output header encodings? (was Re: Backup plan: WSGI 1 Addenda and wsgiref update for Py3)

Thu Sep 23 22:48:51 CEST 2010

On Thu, Sep 23, 2010 at 3:23 PM, P.J. Eby <pje at telecommunity.com> wrote:

> At 11:17 AM 9/23/2010 -0500, Ian Bicking wrote:
>
>> I don't see any reason why Location shouldn't be ASCII.Â  Any header could
>> have any character put in it, of course, there's just no valid case where
>> Location shouldn't be a URL, and URLs are ASCII.Â  Cookie can contain
>> weirdness, yes.Â  I would expect any library that abstracts cookies to
>> handle this (it's certainly doable)... otherwise, this seems like one among
>> many ways a person can do the wrong thing.
>>
>>
>> This can also be detected with the validator, which doesn't avoid runtime
>> errors, but bytes allow runtime errors too -- they will just happen
>> somewhere else (e.g., when a value is converted to bytes in an application
>> or library).
>>
>
> Right: somewhere much closer to the *actual* error, where the developer can
> know the problem is, "I have garbage data or have not selected an
> appropriate codec", rather than "this WSGI stuff is giving me errors some
> place".
>
>
>  If servers print the invalid value on error (instead of just some generic
>> error) I don't think it would be that hard to track down problems.Â  This
>> requires some explicit effort on the part of the server (most servers handle
>> app_iter==None ungracefully, which is a similar problem).
>>
>
> The difference is that if a server rejects non-bytes, you'll know *right
> away* that your app isn't compliant, instead of having to wait until some
> non-latin1 data shows up.
>

No, you've only pushed the encoding elsewhere, and the error elsewhere.
Somewhere someone is probably doing text_value.encode('ascii') (or latin1 or
whatever), and if they haven't tested with non-ascii or non-latin1 input
then they might encounter an error.  It will be in their code, not in the
WSGI server, but the error will be present in all the same situations.  I
don't think it will be much harder to fix if it occurs in the WSGI server,
so long as the error message is at least a little bit helpful.

> AFAICT, there are only two advantages to using text for output headers:
>
> 1. Text is easier to work with, and
> 2. It's symmetric with using text for input headers.
>
> Both of which can still be had, by using the @encode_headers decorator.
>

Sure, anything can be fixed in a library.  But @encode_headers is just
another library.  And it also can't magically appear with 2to3, instead it
requires yet more patches and weird workarounds.

Also, what you are proposing hasn't been considered for PEP 444, though
other combinations of bytes and text have (all symmetric).  So it doesn't
seem to have any clean way to translate into the next version of the
specification.

> I'm a little bit on the fence on this one, because 1) it does seem a little
> pointless (if harmless) to shuffle headers around in bytes form, and 2)
> Location and Set-Cookie are very likely the only headers where any kind of
> damage could ever happen.
>

Set-Cookie only, Location is clean.  The entirety of hand-wringing over
bytes is all just about freakin' cookies.  Or the theory of cookies, I don't
know that anyone has yet encountered any concrete and vexing problems.

But, since it *can* happen, and because it is also really easy to fix the
> API issue with a decorator, I'm still leaning in favor of "output is bytes"
> over "headers are text, bodies are bytes", unless somebody can come up with
> either some actually-bad consequence of using bytes, or some extra-good
> consequence of using text (that isn't addressed by just using the
> decorator).
>
> (Note, by the way, that WSGI design has always leaned in the direction of
> "any convenience that can be handled by a library should be", if it keeps
> the spec simpler and more verifiable.  So, this seems like a good use of
> that principle.)
>

It only fixes the one case of non-Latin1 characters, there are still many
other values you can put into a header (a newline or control character for
instance), and innumerable header-specific issues.  It seems to be adding
complexity for one of the least problematic cases.

--
Ian Bicking  |  http://blog.ianbicking.org
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/web-sig/attachments/20100923/7c5ba94e/attachment.html>