[Web-SIG] WSGI for Python 3

Fri Aug 27 19:01:56 CEST 2010

At 02:17 PM 8/27/2010 +1000, Graham Dumpleton wrote:
>Since the major stumbling block, irrespective of other changes, to any
>sort of agreement is still bytes vs unicode, and where we have a
>reasonable clear definition of what unicode suggestion is, can we
>please as a first step get a definition of what bytes actually implies
>so everyone knows what we are talking about. I specifically ask this,
>as it isn't clear because people don't explain in detail what they
>mean when they are saying 'bytes'.
>
>Going back to my definition #2 in my blog post from a year ago, I had:
>
>1. The application is passed an instance of a Python dictionary
>containing what is referred to as the WSGI environment. All keys in
>this dictionary are native strings. For CGI variables, all names are
>going to be ISO-8859-1 and so where native strings are unicode
>strings, that encoding is used for the names of CGI variables

FYI, one thing that's changed here is the existence of os.environb in 
Python 3.2, at least on non-Windows OSes.

>2. For the WSGI variable 'wsgi.url_scheme' contained in the WSGI
>environment, the value of the variable should be a native string.

Since any meaningful use of this value is going to end up needing to 
be bytes again (e.g. Location headers), and for consistency's sake, I 
lean towards saying this is bytes too.

>3. For the CGI variables contained in the WSGI environment, the values
>of the variables are byte strings.
>
>4. The WSGI input stream 'wsgi.input' contained in the WSGI
>environment and from which request content is read, should yield byte
>strings.
>
>5. The status line specified by the WSGI application must be a byte string.
>
>6. The list of response headers specified by the WSGI application must
>contain tuples consisting of two values, where each value is a byte
>string.
>
>7. The iterable returned by the application and from which response
>content is derived, must yield byte strings.
>
>The points of disagreement I have seen about this is are as follows.
>
>For (1), the keys should also be bytes, including names of 'wsgi.' 
>special keys.
>
>For (2), the value of 'wsgi.url_scheme' should be bytes.
>
>So, do you really want bytes absolutely everywhere, or are keys still
>going to be unicode taken as ISO-8859-1.

If we follow the example of os.environb, then the keys have to be bytes also.

However, I can already see that the big problem with all of this is 
that WSGI code is going to be littered with a plague of "b"s hanging 
off the front of every string literal, and that 2to3 is probably not 
going to handle it correctly.  Making the keys bytes as well just 
multiplies the problem.

>Note that we are not agreeing to the final solution here, just what
>bytes means in contrast to the unicode option, so we know that we are
>comparing only two options and not many options because people have
>different interpretations of what bytes means.
>
>As contrast, what we generally mean by the unicode option is
>definition #3 from my blog post. That being:
>
>1. The application is passed an instance of a Python dictionary
>containing what is referred to as the WSGI environment. All keys in
>this dictionary are native strings. For CGI variables, all names are
>going to be ISO-8859-1 and so where native strings are unicode
>strings, that encoding is used for the names of CGI variables
>
>2. For the WSGI variable 'wsgi.url_scheme' contained in the WSGI
>environment, the value of the variable should be a native string.
>
>3. For the CGI variables contained in the WSGI environment, the values
>of the variables are native strings. Where native strings are unicode
>strings, ISO-8859-1 encoding would be used such that the original
>character data is preserved and as necessary the unicode string can be
>converted back to bytes and thence decoded to unicode again using a
>different encoding.
>
>4. The WSGI input stream 'wsgi.input' contained in the WSGI
>environment and from which request content is read, should yield byte
>strings.
>
>5. The status line specified by the WSGI application should be a byte
>string. Where native strings are unicode strings, the native string
>type can also be returned in which case it would be encoded as
>ISO-8859-1.
>
>6. The list of response headers specified by the WSGI application
>should contain tuples consisting of two values, where each value is a
>byte string. Where native strings are unicode strings, the native
>string type can also be returned in which case it would be encoded as
>ISO-8859-1.
>
>7. The iterable returned by the application and from which response
>content is derived, should yield byte strings. Where native strings
>are unicode strings, the native string type can also be returned in
>which case it would be encoded as ISO-8859-1.
>
>Even though call it unicode, it actually has bytes in places as well.
>The key issues over bytes vs unicode has been in values in the
>dictionary, but as pointed out about, not clear whether for bytes
>option, we are talking about bytes for keys as well and for value of
>'wsgi.url_scheme'.

The main issue I have with this option is that it seems to make it 
trivially easy to write an app or piece of middleware that seems to 
work correctly most of the time, unless placed in the right 
combination with other apps or middleware.

More precisely, an updated wsgiref.validate module used to check the 
"unicode option" would mark such apps and middleware as perfectly 
spec-conformant, yet this spec-conformance would not be transitive - 
i.e., you couldn't say that an assembly of spec-conformant middleware 
and apps would be correct.

Hmmm...  unless...  I guess the only way to be really sure would be 
if the validation process randomly changed the types of input and 
output values to both ways allowed by the spec, and verified that the 
results were still compliant.  ;-)

(In practice, I expect that getting it to do that would be rather 
difficult, though.)

Let me see if I can more precisely narrow down my concern.

Mostly, it boils down to the possibility of non-latin1 unicode 
"escaping" into the output stream...  so if #5, #6 and #7 above were 
changed to bytes-only outputs, then an updated validator can enforce 
those criteria, making spec-compliance verification 
composable.  (That is, if you combine two things that are verified 
compliant, the combination is also known to be compliant.)

So, I could actually support a format that was "unicode (latin1) 
headers in, bytes headers out", and "bytes stream in, bytes stream out".

You can then concentrate all your encoding or decoding operations at 
one place, or even write a decorator to take care of it for you.

>So, can we can clarify this first. And if you are going to comment,
>for that extra clarity, cut and paste my definition #2 above and make
>the changes to it so we have the full definition, rather than just
>referring to bits. That way people who come and read this don't have
>to troll through the whole email chain to derive the context.
>
>Once we get that clarification, then we can perhaps discuss
>exclusively any issues people have with that bytes definition. That is
>before we even try to balance it against the unicode option or look at
>other WSGI 2 changes such as dropping start_response and
>wsgi.file_wrapper.
>
>And I apologise in advance if I start getting cranky and people think
>I am trying to hijack the conversation. I want a solution more so than
>probably anyone else as I can't fix up mod_wsgi until there is and
>right now am I feeling pretty unmotivated towards doing anything with
>mod_wsgi at all, even non Python 3.X enhancements because of all this.
>So, if we can keep focus and try going one step at a time, maybe I
>will not got ballistic. ;-)

Thanks for hanging in there, and also for posting this summary!