[Web-SIG] WSGI for Python 3

Fri Jul 16 21:17:36 CEST 2010

Hello,

Ian said:
> Having two ways of expressing the same information will lead to bugs
> related to which data is canonical.  If an application is using
> SCRIPT_NAME/PATH_INFO and then updates those values in any way, and
> wsgi.raw_script_name/wsgi.raw_path_info are present, then there will be
> weird bugs and code will disagree about which one is correct.  Since %2f
> can exist in the raw versions, there isn't even a way to chunk the two
> variables in the same way.

I can't agree more.

I would propose the following, and excuse me in advance if this has already 
been proposed and discarded -- I've tried to follow this topic on the mailing 
list over the past few months, until it becomes an endless discussion.

I think only the raw values should be available. Even if a middleware changes 
them, it must put them with raw values. And because you cannot change those 
values without knowing what encoding the request uses, the character encoding 
*must* be present.

I know that sounds easy but it's not, because browsers don't specify the 
charset in the Content-Type and instead they generate a new request using the 
charset from the previous response. So the charset is unknown to the 
server/gateway and the middleware stack.

So, what we could do is introduce a mandatory variable called, say, 
wsgi.charset, and would be used as follows:
 - It MUST be set by the server or gateway on every request.
 - Every middleware or application that reads or writes these values MUST use 
the charset specified in wsgi.charset.
 - If a server, gateway, middleware or application wants to change the charset 
and it is possible*, it MUST convert the *entire* request into that charset 
and update wsgi.charset accordingly.
 - When the charset is not specified in the HTTP request, UTF-8 MUST be 
assumed by the server/gateway. Unless another default charset has been 
specified by the user.

I think/hope that will solve all the problems.

What happens when a WSGI application is actually made up two WSGI applications 
and they send the responses in different charsets? If it's not possible to 
configure them so that they both use the same charsets, then one of them would 
have to be wrapped by a middleware which:
 - On egress, converts the responses using the charset used by the other 
application.
 - On ingress, if the charset is not specified in the request, it will assume 
it's the one used by the other application, and thus it will convert the 
request using the charset supported by the wrapped application.

It would look like this:
===
def application(environ, start_response):
    if environ.startswith("/trac/"):
        # Say Trac only supports Latin-1 and we want responses to use UTF-8:
        app = trac.web.main.dispatch_request
        app = CharsetNormalizer(app, response="latin-1", request="utf8")
    else:
        # myapp uses UTF-8
        app = myapp
    return app(environ, start_response)
===

Then there's the string vs bytes issue. Bytes would be the natural choice to 
represent these raw values, but it would probably cause more trouble than they 
solve. So, I think they should be strings that contain the the ASCII raw 
encoded values (i.e., str on both versions of Python).

What do you think about this? Again, sorry if this has been discarded before! 
:)

* For example, you can always convert Latin-1 to UTF-8, but not every UTF-8 
string can be converted to Latin-1.
-- 
Gustavo Narea <xri://=Gustavo>.
| Tech blog: =Gustavo/(+blog)/tech  ~  About me: =Gustavo/about |