[Web-SIG] WSGI 2

Graham Dumpleton graham.dumpleton at gmail.com
Wed Aug 12 01:25:21 CEST 2009


2009/8/12 Henry Precheur <henry at precheur.org>:
> Using bytes for all `environ` values is easy to understand on the
> application side as long as you are aware of the encoding problem. The
> cost is inconvenience, but that's probably OK. It's also simpler to
> implement on the gateway/server side.

Use of bytes everywhere can be inconvenient on the gateway/server
side, at least as far as end result for user.

The specific problem is that WSGI environment is used to hold
information about the original request, as CGI variables, but also can
hold user specified custom variables.

In the case of anything hosted via Apache, such as through mod_wsgi,
mod_fastcgi, mod_fcgid, mod_scgi and mod_cgi(d), users can set such
custom variables using the SetEnv directive. Thus one might say:

  SetEnv trac.env_path /usr/local/trac/site-1

If the rule is that everything in WSGI environment coming from WSGI
adapter must be bytes then you have a potential for mismatch in
expectations of how values will be passed. That is, if set using
SetEnv then would be bytes, but if set using WSGI middleware wrapper
for configuration, more likely going to be string. It would seem
overly onerous to expect WSGI middleware to use bytes for
configuration variables as well and so force all consumers to always
be converting to string using appropriate encoding, where required
encoding potentially unknown.

The underlying problem here is in part, albeit maybe from convention,
that there is a single dictionary for both request information and
user configuration. It isn't though a simple matter of splitting them
either so that request information is always separate. This is because
for FASTCGI, SCGI, CGI, you can't split them as only one grouping in
those cases.

This is why I specifically asked previously, and which no one has
answered, if bytes is to be used, which variables in WSGI environment
should be passed as bytes. If there is a known specified list of
variables which it is known will always be bytes, may be more
manageable. If someone is going to suggest that only CGI variables
should be bytes, then what does that actually mean. Remember that for
FASTCGI, SCGI, CGI there isn't really a distinction and so where the
boundary is as to what is a CGI variable is fuzzy although you could
reverse transformation and get back bytes if know what to do it for.

One could restrict use of bytes to just SCRIPT_NAME, PATH_INFO and
QUERY_STRING and maybe that will suffice. It may not though, because
what about headers such as HTTP_REFERRER? Also, what about additional
SSL_? variables that a SSL module for web sever may add?

Graham

> By choosing bytes, WSGI passes the encoding problem to the application,
> which is good. Let's the application deal with that. It's more likely to
> know what it needs, and what problem it can ignore. I think that 99% of
> the time, applications will just decode bytes to string using UTF-8,
> ignoring invalid values.
>
> However it's likely that we'll see middlewares converting ALL
> environment values to UTF-8, because it's more convienient than using
> bytes. And some middlewares might depend on `environ` values being
> string instead of bytes, because it's convenient too.
>
>
> This issue was already raised by Graham. And I think it's important to
> make it clear. I believe that 'server/CGI' values in the environment
> shouldn't be modified--Of course it should still be possible to add new
> values. This way the stack will always remain in a 'sane' state.
>
> For example if a middleware wants to convert environ values to UTF-8, it
> shouldn't do that:
>
>>   for key, value in environ.items():
>>       environ[key] = str(value)
>
> But something like this--assuming there's only bytes in `environ`:
>
>>   environ['unicode.environ'] = dict((key, str(value, encoding='utf8'))
>>                                     for key, value in environ.items())
>
> I'm in favor of using bytes everywhere. But it's important to document
> why bytes are used and how to use them. I'm not sure this should be
> included in a PEP, maybe a "WSGI best practices"?
>
>
> Cheers,
>
> --
>  Henry Prêcheur
>


More information about the Web-SIG mailing list