On Thu, Sep 23, 2010 at 3:23 PM, P.J. Eby <span dir="ltr">&lt;<a href="mailto:pje@telecommunity.com">pje@telecommunity.com</a>&gt;</span> wrote:<br><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">


At 11:17 AM 9/23/2010 -0500, Ian Bicking wrote:<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

I don&#39;t see any reason why Location shouldn&#39;t be ASCII.Â  Any header could have any character put in it, of course, there&#39;s just no valid case where Location shouldn&#39;t be a URL, and URLs are ASCII.Â  Cookie can contain weirdness, yes.Â  I would expect any library that abstracts cookies to handle this (it&#39;s certainly doable)... otherwise, this seems like one among many ways a person can do the wrong thing.<div class="im">


<br>

<br>

This can also be detected with the validator, which doesn&#39;t avoid runtime errors, but bytes allow runtime errors too -- they will just happen somewhere else (e.g., when a value is converted to bytes in an application or library).<br>


</div></blockquote>

<br>

Right: somewhere much closer to the *actual* error, where the developer can know the problem is, &quot;I have garbage data or have not selected an appropriate codec&quot;, rather than &quot;this WSGI stuff is giving me errors some place&quot;.<br>


<br>

<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

If servers print the invalid value on error (instead of just some generic error) I don&#39;t think it would be that hard to track down problems.Â  This requires some explicit effort on the part of the server (most servers handle app_iter==None ungracefully, which is a similar problem).<br>


</blockquote>

<br>

The difference is that if a server rejects non-bytes, you&#39;ll know *right away* that your app isn&#39;t compliant, instead of having to wait until some non-latin1 data shows up.<br></blockquote><div><br>No, you&#39;ve only pushed the encoding elsewhere, and the error elsewhere.  Somewhere someone is probably doing text_value.encode(&#39;ascii&#39;) (or latin1 or whatever), and if they haven&#39;t tested with non-ascii or non-latin1 input then they might encounter an error.  It will be in their code, not in the WSGI server, but the error will be present in all the same situations.  I don&#39;t think it will be much harder to fix if it occurs in the WSGI server, so long as the error message is at least a little bit helpful.<br>


 </div><blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;">

AFAICT, there are only two advantages to using text for output headers:<br>

<br>

1. Text is easier to work with, and<br>

2. It&#39;s symmetric with using text for input headers.<br>

<br>

Both of which can still be had, by using the @encode_headers decorator.<br></blockquote><div><br>Sure, anything can be fixed in a library.  But @encode_headers is just another library.  And it also can&#39;t magically appear with 2to3, instead it requires yet more patches and weird workarounds.<br>


<br>Also, what you are proposing hasn&#39;t been considered for PEP 444, though other combinations of bytes and text have (all symmetric).  So it doesn&#39;t seem to have any clean way to translate into the next version of the specification.<br>


 </div><blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;">

I&#39;m a little bit on the fence on this one, because 1) it does seem a little pointless (if harmless) to shuffle headers around in bytes form, and 2) Location and Set-Cookie are very likely the only headers where any kind of damage could ever happen.<br>


</blockquote><div><br>Set-Cookie only, Location is clean.  The entirety of hand-wringing over bytes is all just about freakin&#39; cookies.  Or the theory of cookies, I don&#39;t know that anyone has yet encountered any concrete and vexing problems.<br>


<br></div><blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;">

But, since it *can* happen, and because it is also really easy to fix the API issue with a decorator, I&#39;m still leaning in favor of &quot;output is bytes&quot; over &quot;headers are text, bodies are bytes&quot;, unless somebody can come up with either some actually-bad consequence of using bytes, or some extra-good consequence of using text (that isn&#39;t addressed by just using the decorator).<br>


<br>

(Note, by the way, that WSGI design has always leaned in the direction of &quot;any convenience that can be handled by a library should be&quot;, if it keeps the spec simpler and more verifiable.  So, this seems like a good use of that principle.)<br>


</blockquote></div><br>It only fixes the one case of non-Latin1 characters, there are still many other values you can put into a header (a newline or control character for instance), and innumerable header-specific issues.  It seems to be adding  complexity for one of the least problematic cases.<br clear="all">


<br>--<br>Ian Bicking  |  <a href="http://blog.ianbicking.org">http://blog.ianbicking.org</a><br>