[Web-SIG] WSGI 2

Robert Brewer fumanchu at aminus.org
Wed Aug 12 06:19:49 CEST 2009


Graham Dumpleton wrote:
> So, for WSGI 1.0 style of interface and Python 3.0, the following is
> what I was going to implement.

FWIW, I'll answer with what we've implemented for CherryPy 3.2.

> 1. When running under Python 3, applications SHOULD produce bytes
> output, status line and headers.

Yup.

> 2. When running under Python 3, servers and gateways MUST accept
> strings for output, status line and headers. Such strings must be
> converted to bytes output using 'latin-1'. If string cannot be
> converted then is treated as an error.

Yes.

> 3. When running under Python 3, servers MUST provide wsgi.input as a
> binary (byte) input stream.

Boy howdy.

> 4. When running under Python 3, servers MUST provide a text stream for
> wsgi.errors. In converting this to a byte stream for writing to a
> file, the default encoding would be applied.

I'll look into it.

> 5. When running under Python 3, servers MUST provide CGI HTTP and
> server variables as strings. Where such values are sourced from a byte
> string, be that a Python byte string or C string, they should be
> converted as 'UTF-8'. If a specific web server infrastructure is able
> to support different encodings, then the WSGI adapter MAY provide a
> way for a user of the WSGI adapter to customise on a global basis, or
> on a per value basis what encoding is used, but this is entirely
> optional. Note that there is no requirement to deal with RFC 2047.

We're passing unicode for almost everything.

REQUEST_METHOD and wsgi.url_scheme are parsed from the Request-Line, and must be ascii-decodable. So are SERVER_PROTOCOL and our custom ACTUAL_SERVER_PROTOCOL entries.

The original bytes of the Request-URI are stored in REQUEST_URI. However, PATH_INFO and QUERY_STRING are parsed from it, and decoded via a configurable charset, defaulting to UTF-8. If the path cannot be decoded with that charset, ISO-8859-1 is tried. Whichever is successful is stored at environ['REQUEST_URI_ENCODING'] so middleware and apps can transcode if needed. Our origin server always sets SCRIPT_NAME to '', but if we populated it, we would make it decoded by the same charset.

All request headers are decoded via ISO-8859-1, which can't fail. Applications are expected to transcode these values if they believe them to be in another encoding.

> This is where I am going to diverge from what has been discussed before.
> 
> The reason I am going to pass as UTF-8 and not latin-1 is that it
> looks like Apache effectively only supports use of UTF-8. Since this
> means that mod_wsgi, and Apache modules for FASTCGI, SCGI, AJP and
> even CGI likely cannot handle anything besides UTF-8 then I really
> can't see the point of trying to cater for a theoretical possibility
> that some HTTP client could use something besides UTF-8. In other
> words, the predominant case will be UTF-8, so let us target that.

That is predominant for the Request-URI, and we are defaulting to utf-8 for that as I mentioned above. I believe I demonstrated in http://mail.python.org/pipermail/web-sig/2009-April/003755.html that UTF-8 cannot be the predominant encoding for request headers, which are instead mostly ASCII with a few ISO-8859-1's, which is why we are defaulting to ISO-8859-1.

> So, rather than burden every WSGI application with the need to convert
> from latin-1 back to bytes and then to UTF-8, let the server deal with
> it, with server using sensible default, and where server
> infrastructure can handle a different encoding, then it can provide
> option to use that encoding and WSGI application doesn't need to
> change.

If there are indeed more headers which are ISO-8859-1, then that same argument cuts both ways.

I have no problem doing the same thing here as we do for PATH_INFO: a configurable charset, or better yet a list of charsets to try in order, with a sensible default, even UTF-8 would be fine. Regardless of the default, if it is configurable, then the successful encoding should be put in a canonical environ entry so apps can transcode it if the server got it wrong.

Re:bytes. We really do not want the server to set any of the above environ entries (except REQUEST_URI) to bytes. I'm surprised those of you who have substantial numbers of WSGI middleware aren't fighting this; it would mean decoding the same environ entries every time you switched middleware providers. Some of you said as much at PyCon: http://mail.python.org/pipermail/web-sig/2009-March/003701.html


Robert Brewer
fumanchu at aminus.org
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/web-sig/attachments/20090811/42f876ff/attachment-0001.htm>


More information about the Web-SIG mailing list