[Web-SIG] Proposal to remove SCRIPT_NAME/PATH_INFO

Wed Sep 23 05:26:15 CEST 2009

At 09:22 PM 9/22/2009 -0500, Ian Bicking wrote:
>OK, I mentioned this in the last thread, but... I can't keep up with 
>all this discussion, and I bet you can't either.
>
>So, here's a rough proposal for WSGI and unicode:
>
>I propose we switch primarily to "native" strings: str on both Python 2 and 3.

+1, if you mean the strings have the same content, 
character-for-character on Python 2.3.  That is, a \x80 byte in a 
Python 2 'str' is matched by an \x80 character in the Python 3 
'str'.  (I presume that's what we mean by "native", but I want to be sure.)

>Specifically:
>environ keys: native
>environ CGI values: native
>wsgi.* (that is text): native
>response status: native
>response headers: native

+1 all of the above, again per the previous caveat.  IOW, if all this 
stuff is still exactly as laid out in PEP 333 to start with.  (Minor 
issue: the CGI environment vars will need transcoding on Python 3, if 
the system encoding is not latin 1.)

>wsgi.input remains byte-oriented, as does the response app_iter.

+1, acceptable as errata of the original PEP wrt to Python 3, as long 
as we also the app to yield "native" strings (Unicode chars 0-255) on Python 3.

>I then propose that we eliminate SCRIPT_NAME and PATH_INFO.Â  Instead we have:
>
>wsgi.script_name
>wsgi.path_info (I'm not entirely set on these names)
>
>These both form the original path.Â  It is not URL decoded, so it 
>should be ASCII.Â  (I believe non-ASCII could be rejected by the 
>server, with Bad Request?Â  A server could also choose to treat it 
>as UTF8 or Latin1 and encode unsafe characters to make it 
>ASCII)Â  Thus to re-form the URL, you do:
>
>environ['wsgi.url_scheme'] + '://' + environ['HTTP_HOST'] + 
>environ['wsgi.script_name'] + environ['wsgi.path_info'] + '?' + 
>environ['QUERY_STRING']

I'm not clear how all the above is going to work, since I had the 
impression from the comments of server specialists here (e.g. Graham, 
Robert and Alan) that you simply can't do this 
correctly/consistently, especially under Java.  I think a more 
detailed proposal is needed here, along with a brief rationale for 
why we need them to be specially named like this.

>All incoming headers will be treated as Latin1.Â  If an application 
>suspects another encoding, it is up to the application to transcode 
>the header into another encoding.Â

+1.

>The transcoded value should not be put into the environ.Â

Make that a MUST NOT, and I'm good.  ;-)

>In most cases headers should be ASCII, and Latin1 is simply a 
>fallback that allows all bytes to be represented in both Python 2 and 3.

Hurray!

>Similarly all outgoing headers will be Latin1.Â  Thus if you 
>(against good sense) decide to put UTF8 into a cookie, you can do:
>
>headers.append(('Set-Cookie', unicode_text.encode('UTF8').decode('latin1')))
>
>The server will then decode the text as latin1, sending the UTF8 
>bytes.Â  This is lame, but non-ASCII in headers is lame.Â  It would 
>be preferable to do:
>
>headers.append(('Set-Cookie', urllib.quote(unicode_text.encode('UTF8'))))
>
>This sends different text, but is highly preferable.Â  If you wanted 
>to parse a cookie that was set as UTF8, you'd do:
>
>parse_cookie(environ['HTTP_COOKIE'].encode('latin1').decode('utf8'))
>
>Again, it would be better to do;
>
>parse_cookie(urllib.unquote(environ['HTTP_COOKIE']).decode('utf8'))

Looking good.

>Other variables like environ['wsgi.url_scheme'], 
>environ['CONTENT_TYPE'], etc, will be native strings.Â  A Python 3 
>hello work app will then look like:
>
>def hello_world(environ):
>Â Â Â  return ('200 OK', [('Content-type', 'text/html; 
>charset=utf8')], ['Hello World!'.encode('utf8')])
>
>start_response and changes to wsgi.input are incidental to what I'm 
>proposing here (except that wsgi.input will be bytes); we can decide 
>about themseparately.

More +1 goodness from me.

>Outstanding issues:
>
>Well, the biggie: is it right to use native strings for the environ 
>values, and response status/headers?Â  Specifically, tricks like the 
>latin1 transcoding won't work in Python 2, but will in Python 
>3.Â  Is this weird?Â  Or just something you have to think about when 
>using the two Python versions?

If your app is written to spec and you're *not* using unicode, 
nothing changes.  But if your app is written to spec and you *are* 
using unicode, then you're currently calling .decode() on anything 
you read from environ or input, in order to get to unicode.  That 
will break on 3.x, since str lacks a .decode() method.  So, you will 
at least fail every time with an error message, and fix it, unless 
somebody writes a 2to3 fixer that can infer a "wsgi native" type and 
do the right thing.  I don't know if such a thing is possible.

However, it also seems to me that it's pretty unlikely anybody is 
doing this decoding in multiple places in their code, so it seems 
unlikely to be very painful for very long.  Libraries tend to have 
request objects, or functions to do this kind of thing, so only those 
should have to change in 3.x.

>What happens if you give unicode text in the response headers that 
>cannot be encoded as Latin1?

Then something's going to fail, and unfortunately it's going to take a while.

>Should some things specifically be ASCII?Â  E.g., status.

+1.

>Should some things be unicode on Python 2?

Not that I can think of.  To be honest, this whole thread has made me 
appreciate the Python 2 str type more than ever.  ;-)

>Is there a common case here that would be inefficient?

Beats me.