On Thu, Sep 23, 2010 at 11:06 AM, P.J. Eby <span dir="ltr">&lt;<a href="mailto:pje@telecommunity.com">pje@telecommunity.com</a>&gt;</span> wrote:<br><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">


At 12:57 PM 9/21/2010 -0400, Ian Bicking wrote:<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

On Tue, Sep 21, 2010 at 12:09 PM, P.J. Eby &lt;&lt;mailto:<a href="mailto:pje@telecommunity.com" target="_blank">pje@telecommunity.com</a>&gt;<a href="mailto:pje@telecommunity.com" target="_blank">pje@telecommunity.com</a>&gt; wrote:<br>


The Python 3 specific changes are to use:<br>

<br>

* ``bytes`` for I/O streams in both directions<br>

* ``str`` for environ keys and values<br>

* ``bytes`` for arguments to start_response() and write()<br>

<br>

<br>

This is the only thing that seems odd to me -- it seems like the response should be symmetric with the request, and the request in this case uses str for headers (status being header-like), and bytes for the body.<br>

</blockquote>

<br>

So, I&#39;ve given some thought to your suggestion, and, while it&#39;s true that most of the output headers are far less prone to ending up with unintended unicode content, there are at least two output headers that can include some sort of application content (and can therefore have random failures): Location and Set-Cookie.<br>


<br>

If these headers accidentally contain non-Latin1 characters, the error isn&#39;t detectable until the header reaches the origin server doing the transmission encoding, and it&#39;ll likely be a dynamic (and therefore hard-to-debug) error.<br>


</blockquote><div><br>I don&#39;t see any reason why Location shouldn&#39;t be ASCII.  Any header could have any character put in it, of course, there&#39;s just no valid case where Location shouldn&#39;t be a URL, and URLs are ASCII.  Cookie can contain weirdness, yes.  I would expect any library that abstracts cookies to handle this (it&#39;s certainly doable)... otherwise, this seems like one among many ways a person can do the wrong thing.<br>


<br>This can also be detected with the validator, which doesn&#39;t avoid runtime errors, but bytes allow runtime errors too -- they will just happen somewhere else (e.g., when a value is converted to bytes in an application or library).<br>


<br>If servers print the invalid value on error (instead of just some generic error) I don&#39;t think it would be that hard to track down problems.  This requires some explicit effort on the part of the server (most servers handle app_iter==None ungracefully, which is a similar problem).<br>


<br><br></div><blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;">

However, if the output is always bytes (and this can be relatively-statically verified), then any error can&#39;t occur except *inside* the application, where the app&#39;s developer can find it more easily.<br>

<br>

So I guess the question boils down to: would we rather make sure that coding errors happen *inside* applications, or would we rather make porting WSGI apps trivial (or nearly so)?<br>

<br>

But I think that it&#39;s possible here to have one&#39;s cake and eat it too: if we require bytes for all outputs, but provide a pair of decorators in wsgiref.util like the following:<br>

<br>

    def encode_body(codec=&#39;utf8&#39;):<br>

        &quot;&quot;&quot;Allow a WSGI app to output its response body as strings w/specified encoding&quot;&quot;&quot;<br>

        def decorate(app):<br>

            def encode(response):<br>

                try:<br>

                    for data in response:<br>

                        yield data.encode(codec)<br>

                finally:<br>

                    if hasattr(response, &#39;close&#39;):<br>

                        response.close()<br>

            def decorated_app(environ, start_response):<br>

                def start(status, response_headers, exc_info=None):<br>

                    _write = start_response(status, response_headers, exc_info)<br>

                    def write(data):<br>

                        return _write(data.encode(codec))<br>

                    return write<br>

                return encode(app(environ, start))<br>

            return decorated_app<br>

        return decorate<br>

<br>

    def encode_headers(codec=&#39;latin1&#39;):<br>

        &quot;&quot;&quot;Allow a WSGI app to output its headers as strings, w/specified encoding&quot;&quot;&quot;<br>

        def decorate(app):<br>

            def decorated_app(environ, start_response):<br>

                def start(status, response_headers, exc_info=None):<br>

                    status = status.encode(codec)<br>

                    response_headers = [<br>

                        (k.encode(codec), v.encode(codec)) for k,v in response_headers<br>

                    ]<br>

                    return start_response(status, response_headers, exc_info)<br>

                return app(environ, start)<br>

            return decorated_app<br>

        return decorate<br>

<br>

So, this seems like a win-win to me: relatively-static verification, errors stay in the app (or at least in the decorator), and the API is clean-and-easy.  Indeed, it seems likely that at least some apps that don&#39;t read wsgi.input themselves could be ported *just* by adding the appropriate decorator(s).  And, if your app is using unicode on 2.x, you can even use the same decorators there, for the benefit of 2to3.  (Assuming I release an updated standalone wsgiref version with the decorators, of course.)<br>


</blockquote><div><br>This doesn&#39;t seem that different than the validator, except that the decorator uses a different interface internally and externally (the internal interface using text, the external one bytes).<br>


</div></div><br clear="all"><br>-- <br>Ian Bicking  |  <a href="http://blog.ianbicking.org">http://blog.ianbicking.org</a><br>