[Web-SIG] WSGI for Python 3
me at gustavonarea.net
Fri Jul 16 21:17:36 CEST 2010
> Having two ways of expressing the same information will lead to bugs
> related to which data is canonical. If an application is using
> SCRIPT_NAME/PATH_INFO and then updates those values in any way, and
> wsgi.raw_script_name/wsgi.raw_path_info are present, then there will be
> weird bugs and code will disagree about which one is correct. Since %2f
> can exist in the raw versions, there isn't even a way to chunk the two
> variables in the same way.
I can't agree more.
I would propose the following, and excuse me in advance if this has already
been proposed and discarded -- I've tried to follow this topic on the mailing
list over the past few months, until it becomes an endless discussion.
I think only the raw values should be available. Even if a middleware changes
them, it must put them with raw values. And because you cannot change those
values without knowing what encoding the request uses, the character encoding
*must* be present.
I know that sounds easy but it's not, because browsers don't specify the
charset in the Content-Type and instead they generate a new request using the
charset from the previous response. So the charset is unknown to the
server/gateway and the middleware stack.
So, what we could do is introduce a mandatory variable called, say,
wsgi.charset, and would be used as follows:
- It MUST be set by the server or gateway on every request.
- Every middleware or application that reads or writes these values MUST use
the charset specified in wsgi.charset.
- If a server, gateway, middleware or application wants to change the charset
and it is possible*, it MUST convert the *entire* request into that charset
and update wsgi.charset accordingly.
- When the charset is not specified in the HTTP request, UTF-8 MUST be
assumed by the server/gateway. Unless another default charset has been
specified by the user.
I think/hope that will solve all the problems.
What happens when a WSGI application is actually made up two WSGI applications
and they send the responses in different charsets? If it's not possible to
configure them so that they both use the same charsets, then one of them would
have to be wrapped by a middleware which:
- On egress, converts the responses using the charset used by the other
- On ingress, if the charset is not specified in the request, it will assume
it's the one used by the other application, and thus it will convert the
request using the charset supported by the wrapped application.
It would look like this:
def application(environ, start_response):
# Say Trac only supports Latin-1 and we want responses to use UTF-8:
app = trac.web.main.dispatch_request
app = CharsetNormalizer(app, response="latin-1", request="utf8")
# myapp uses UTF-8
app = myapp
return app(environ, start_response)
Then there's the string vs bytes issue. Bytes would be the natural choice to
represent these raw values, but it would probably cause more trouble than they
solve. So, I think they should be strings that contain the the ASCII raw
encoded values (i.e., str on both versions of Python).
What do you think about this? Again, sorry if this has been discarded before!
* For example, you can always convert Latin-1 to UTF-8, but not every UTF-8
string can be converted to Latin-1.
Gustavo Narea <xri://=Gustavo>.
| Tech blog: =Gustavo/(+blog)/tech ~ About me: =Gustavo/about |
More information about the Web-SIG