[Web-SIG] WSGI for Python 3

Sat Jul 17 07:51:21 CEST 2010

On Saturday, July 17, 2010, Ian Bicking <ianb at colorstudy.com> wrote:
> On Fri, Jul 16, 2010 at 4:33 AM, And Clover <and-py at doxdesk.com> wrote:
>
>
> On 07/14/2010 06:43 AM, Ian Bicking wrote:
>
>
> There's only a couple tricky keys: SCRIPT_NAME, PATH_INFO,
> and HTTP_COOKIE.
>
>
>
> (And of those, PATH_INFO is the only one that really matters, in that no-one really uses non-ASCII script filenames, and non-ASCII characters in Cookie/Set-Cookie are still handled so differently/brokenly across browsers that you can't rely on them at all.)
>
>
>
>
> * I (re)propose we eliminate SCRIPT_NAME and PATH_INFO and replace them
> exclusively with encoded versions
>
>
>
> For compatibility with existing apps, how about keeping the existing SCRIPT_NAME and PATH_INFO as-is (with all their problems), and specifying that the new 'raw' versions (whatever they are called) are added only if they really are raw, not reconstructed.
>
> Having two ways of expressing the same information will lead to bugs related to which data is canonical.  If an application is using SCRIPT_NAME/PATH_INFO and then updates those values in any way, and wsgi.raw_script_name/wsgi.raw_path_info are present, then there will be weird bugs and code will disagree about which one is correct.  Since %2f can exist in the raw versions, there isn't even a way to chunk the two variables in the same way.
>
>
> Then existing scripts that don't care about non-ASCII and slashes can carry on as before, and for apps that do care about them, they'll be able to be *sure* the input is correct. Or they can fall back to PATH_INFO when not present, and avoid producing these kind of URLs in response.
>
> I don't think it works to imagine you can just not care about non-ASCII.  Requests come in.  WSGI should represent those requests.  If a request comes in with non-ASCII bytes then WSGI needs to do *something* with it.  I don't want to have to configure servers with application policy; servers should just work.
>
> And this doesn't help with Python 3: either we have byte values of SCRIPT_NAME and PATH_INFO in Python 3, or we have text values.  I think bytes will be more awkward to port to than text, and inconsistent with other WSGI values.  If we have text then we have to choose an encoding.  Latin1 will work, but it will be the exact wrong encoding most of the time as UTF-8 is the typical  (unlike other headers, where Latin1 will mostly be an okay encoding, or as good a guess as we have).  If we firmly remove these keys then we can avoid this choice entirely... and we conveniently also get a better representation of the request.

One reason I don't want to see the existing keys removed is for
debugging purposes. In Apache, various Apache modules such as
mod_rewrite will operate on that translated path. I am concerned that
if only the raw one is available in the WSGI application then
confusion may arise where something doesn't go right with rewrites
because the only information that may be able to be dumped in the way
of debug by an application will be different to what other Apache
modules may operate on. If you aren't going to make use of CGI
versions, then would still like to see them present but perhaps
renamed. That way you don't have a loss of information when it comes
to trying to debug stuff. I could perhaps just put this in a
Apache/mod_wsgi specific key as well given that the issue is
particular to it. Thus might have apache.path_info or cgi.path_info.

Graham

> Note that libraries can smooth over this change; WebOb for instance will certainly still support req.script_name/req.path_info by decoding the raw values.  Admittedly lots of code use these values directly... but at least if they get a KeyError the port/fix will be obvious (as opposed to out of sync values, which will only emerge as a problem occasionally -- I'd rather not invite more occasional bugs).
>
> --
> Ian Bicking  |  http://blog.ianbicking.org
>