[Web-SIG] String Types in WSGI [Graham's WSGI for py3]
armin.ronacher at active-4.com
Thu Sep 17 18:26:52 CEST 2009
Graham currently proposes the following behaviors for Strings in WSGI
(Python version independent). However this mail only covers the Python
3 part which I assume becomes a separate section in the PEP or even WSGI
byte string == contains bytes
unicode string == contains unicode charpoints*
native string == what the python version uses a a string
(bytes in python 2, unicode in python 3)
* ucs2 / ucs4 is ignored here. You might still have problems
with surrogate pairs in ucs2 python builds and jython.
> 2. For the WSGI variable 'wsgi.url_scheme' contained in the WSGI
> environment, the value of the variable should be a native string.
URLs in general are a tricky topic. For this particular field it does
not matter if we decide on bytes or unicode because it will always only
contain ASCII characters. This should be picked consistencly with the
type of PATH_INFO and SCRIPT_NAME.
> 3. For the CGI variables contained in the WSGI environment, the values
> of the variables are byte strings.
\o/ Totally agree with that.
> 4. The WSGI input stream 'wsgi.input' contained in the WSGI
> environment and from which request content is read, should yield byte
> 5. The status line specified by the WSGI application must be a byte
> 6. The list of response headers specified by the WSGI application must
> contain tuples consisting of two values, where each value is a byte
Makes sense because people stuff a lot of non latin1 stuff in there.
However I'm fine with latin1 for headers here as well but that would
probably only affect cookie and custom headers.
> 7. The iterable returned by the application and from which response
> content is derived, must yield byte strings.
I totally agree.
However Graham moves further away from that in the rest of the blog post
because he wants to point out that people use WSGI directly and that
explicit bytestrings in Python 3 confuse people. The latest iteration
in the blog post is not to use bytestrings in a single location except
for headers and the input stream.
I thought a lot about this in the past and I welcome the step to make
WSGI harder to use! This might sound absurd, but once encodings are
really explicit, people will think about it. I think we should
discourage *applications* written in WSGI and link to implementations in
The big problems are always PATH_INFO and SCRIPT_NAME. Those are the
only values that are in the dict URL-decoded and might contain non-ASCII
characters. (except for headers, but that's a different story because
the only real-world problem there are cookie headers and those are
troubleing for more reasons than just character sets)
My latest change to the WSGI sandbox hg repo  was that I added a
notice that later PEP revisions might document a RAW_SCRIPT_NAME or
something that contains the URL quoted values. It however turns out
that this value is not available from within a webserver context (We're
talking about Apache and IIS here) so that the problem of unquoted
values will not go away.
It also introduces the concept of URI encodings. I'm especially unhappy
with this part. It would mean that implementations would have to follow
the WSGI URI encoding if set. Most of the applications are using either
latin1 or UTF-8 URLs, I would leave that including the decoding of *all*
incoming data to the user.
So yes, I'm all for definition #1 in the blog post where Graham says:
> The first is that although WSGI 1.0 on Python 3.X should strictly be
> bytes everywhere as per Definition #1, it is probably too late to
> enforce this now.
I don't think so. Reasoning: Python 3.0 does not work and is considered
outdated, Python 3.1 might ship with a wsgiref that's against a
revisioned spec, but cgi.FieldStorage is still broken there, making it
impossible to use for anything but small applications.
More information about the Web-SIG