[Web-SIG] WSGI for Python 3

Fri Aug 27 21:26:53 CEST 2010

On Fri, Aug 27, 2010 at 12:17 AM, Graham Dumpleton
<graham.dumpleton at gmail.com> wrote:
> On 27 August 2010 13:45, P.J. Eby <pje at telecommunity.com> wrote:
>> At 01:37 AM 8/27/2010 +0200, Armin Ronacher wrote:
>>>
>>> Hi,
>>>
>>> Is there a status update on that now I missed?  Did something decide on
>>> bytes for the environment values or are we still unsure about that?
>>
>> To the extent we're "unsure", I think the holdup is simply that nobody has
>> tried doing an all-bytes WSGI implementation -- unless of course you count
>> all our Python 2.x experience as experience with an all-bytes
>> implementation.  ;-)
>>
>> (Of course, that experience won't help us with Python 3 stdlib issues.)
>>
>>
>>> At that point I don't care at all about what is decided on as long as
>>> something is decided.  Can someone please stand up and just do that? :)
>>
>> Essentially the problem right now is that unless such a choice is made,
>> there's little hope of getting the stdlib issues to be resolved, because we
>> can't exactly file bug reports against the stdlib if we don't know what we
>> want it to do.  ;-)
>>
>> My personal inclination is to define WSGI 2 as a bytes-oriented protocol,
>> and then encourage people to port to WSGI 2 before moving to Python 3.
>
> Since the major stumbling block, irrespective of other changes, to any
> sort of agreement is still bytes vs unicode, and where we have a
> reasonable clear definition of what unicode suggestion is, can we
> please as a first step get a definition of what bytes actually implies
> so everyone knows what we are talking about. I specifically ask this,
> as it isn't clear because people don't explain in detail what they
> mean when they are saying 'bytes'.
>
> Going back to my definition #2 in my blog post from a year ago, I had:
>
> 1. The application is passed an instance of a Python dictionary
> containing what is referred to as the WSGI environment. All keys in
> this dictionary are native strings. For CGI variables, all names are
> going to be ISO-8859-1 and so where native strings are unicode
> strings, that encoding is used for the names of CGI variables
>
> 2. For the WSGI variable 'wsgi.url_scheme' contained in the WSGI
> environment, the value of the variable should be a native string.
>
> 3. For the CGI variables contained in the WSGI environment, the values
> of the variables are byte strings.
>
> 4. The WSGI input stream 'wsgi.input' contained in the WSGI
> environment and from which request content is read, should yield byte
> strings.
>
> 5. The status line specified by the WSGI application must be a byte string.
>
> 6. The list of response headers specified by the WSGI application must
> contain tuples consisting of two values, where each value is a byte
> string.
>
> 7. The iterable returned by the application and from which response
> content is derived, must yield byte strings.
>
> The points of disagreement I have seen about this is are as follows.
>
> For (1), the keys should also be bytes, including names of 'wsgi.' special keys.
>
> For (2), the value of 'wsgi.url_scheme' should be bytes.
>
> So, do you really want bytes absolutely everywhere, or are keys still
> going to be unicode taken as ISO-8859-1.
>
> Note that we are not agreeing to the final solution here, just what
> bytes means in contrast to the unicode option, so we know that we are
> comparing only two options and not many options because people have
> different interpretations of what bytes means.
>
> As contrast, what we generally mean by the unicode option is
> definition #3 from my blog post. That being:
>
> 1. The application is passed an instance of a Python dictionary
> containing what is referred to as the WSGI environment. All keys in
> this dictionary are native strings. For CGI variables, all names are
> going to be ISO-8859-1 and so where native strings are unicode
> strings, that encoding is used for the names of CGI variables
>
> 2. For the WSGI variable 'wsgi.url_scheme' contained in the WSGI
> environment, the value of the variable should be a native string.
>
> 3. For the CGI variables contained in the WSGI environment, the values
> of the variables are native strings. Where native strings are unicode
> strings, ISO-8859-1 encoding would be used such that the original
> character data is preserved and as necessary the unicode string can be
> converted back to bytes and thence decoded to unicode again using a
> different encoding.
>
> 4. The WSGI input stream 'wsgi.input' contained in the WSGI
> environment and from which request content is read, should yield byte
> strings.
>
> 5. The status line specified by the WSGI application should be a byte
> string. Where native strings are unicode strings, the native string
> type can also be returned in which case it would be encoded as
> ISO-8859-1.
>
> 6. The list of response headers specified by the WSGI application
> should contain tuples consisting of two values, where each value is a
> byte string. Where native strings are unicode strings, the native
> string type can also be returned in which case it would be encoded as
> ISO-8859-1.
>
> 7. The iterable returned by the application and from which response
> content is derived, should yield byte strings. Where native strings
> are unicode strings, the native string type can also be returned in
> which case it would be encoded as ISO-8859-1.
>
> Even though call it unicode, it actually has bytes in places as well.
> The key issues over bytes vs unicode has been in values in the
> dictionary, but as pointed out about, not clear whether for bytes
> option, we are talking about bytes for keys as well and for value of
> 'wsgi.url_scheme'.
>
> So, can we can clarify this first. And if you are going to comment,
> for that extra clarity, cut and paste my definition #2 above and make
> the changes to it so we have the full definition, rather than just
> referring to bits. That way people who come and read this don't have
> to troll through the whole email chain to derive the context.
>
> Once we get that clarification, then we can perhaps discuss
> exclusively any issues people have with that bytes definition. That is
> before we even try to balance it against the unicode option or look at
> other WSGI 2 changes such as dropping start_response and
> wsgi.file_wrapper.
>
> And I apologise in advance if I start getting cranky and people think
> I am trying to hijack the conversation. I want a solution more so than
> probably anyone else as I can't fix up mod_wsgi until there is and
> right now am I feeling pretty unmotivated towards doing anything with
> mod_wsgi at all, even non Python 3.X enhancements because of all this.
> So, if we can keep focus and try going one step at a time, maybe I
> will not got ballistic. ;-)
>
> Graham
> _______________________________________________
> Web-SIG mailing list
> Web-SIG at python.org
> Web SIG: http://www.python.org/sigs/web-sig
> Unsubscribe: http://mail.python.org/mailman/options/web-sig/paul.joseph.davis%40gmail.com
>

I ran into this while I was attempting to put together enough code to
play with a wsgiref2 that ran on both 2.x and 3.x. As Graham has
deftly pointed out, its a pretty big pain in the rear.

Specifically, if we specify that all keys in the environ dictionary
are byte strings, then there's a noticeable amount of pain in trying
to write code that runs on both platforms. I object to 2to3.py on
religious grounds, so when I was implementing this I was doing so with
code that would run unmodified on both 2 and 3.

What I ran into is that if you want to support older than 2.6, all
environ key lookups must be wrapped with a helper function. This makes
code that uses the dict full of things like
environ[b("wsgi.errors")].write(b("some message")) where b is a helper
I wrote to convert to the right type for a given interpreter. And I'm
still not sure how Jython works with strings. PEP 333 says its unicode
only which makes me wonder how they would react to the bytes
everywhere approach.

I'm also not a big fan of automatically applying a default encoding to
*any* of the bytes read in an HTTP request. After contemplating for
awhile I came to the conclusion that header names are really part of
the request itself, where as the other keys in the environ are
metadata about the request. Having the two different types of data in
the same space domain seemed to be the root of the problem. So I
rearranged things so that there's an "http.headers" key that is a
dictionary with byte strings for keys and values.

I haven't managed to find any time to write a test suite for the spec
I was toying with but I figure its far enough along that it might be
interesting to someone. This code should be runnable on 2.5, 2.6 and
3.2. When I get back to working on it, my next goal was to figure out
a way to write the test suite in a way that it could run on any
implementation to test for compliance.

Code is at: http://github.com/davisp/wsgiref2

Paul Davis