[Web-SIG] WSGI 2

Wed Aug 12 03:23:44 CEST 2009

2009/8/12 Henry Precheur <henry at precheur.org>:
> On Wed, Aug 12, 2009 at 09:25:21AM +1000, Graham Dumpleton wrote:
>> Use of bytes everywhere can be inconvenient on the gateway/server
>> side, at least as far as end result for user.
>
> Yes, but wouldn't it be simpler for mod_wsgi to only deal with bytes?
> unicode C strings -> bytes and char* -> bytes conversions seem
> straightforward.

Programming at C code level it doesn't really make any difference as
pretty well same amount of C API calls. All the code is also already
written for this in mod_wsgi and configurable to be done any which way
so people could play with different alternatives. When decision
actually made, just need to make that decision be the default. Only
extra complexity comes from where subset of WSGI environment should be
bytes and to make that at least somewhat easier, need simple well
defined rule and that where if first character of variable name is
uppercase letter, then use bytes, might be reasonable. Anything more
complicated may be a pain.

> But char* -> string doesn't look easy to do, since you have to 'guess'
> the encoding.

Only for stuff that derives from HTTP request, which is the argument
for using bytes and leave it up to application to decide. For user
custom variables, then would be UTF-8 as that is what Apache
effectively treats configuration file as being.

> This is suppositions, I have never worked on WSGI server/gateway.

Which is the same for most people and perhaps why many don't want to
wade into this argument. That is, attitude is that it is a problem for
those who want to write hosting adapters and not an issue for
application developers. Reality is that it needs to be guided by
application developers as they are the ones who have to work with
whatever interface is defined.

Graham

> Correct me if I'm wrong.
>
>> The specific problem is that WSGI environment is used to hold
>> information about the original request, as CGI variables, but also can
>> hold user specified custom variables.
>>
>> In the case of anything hosted via Apache, such as through mod_wsgi,
>> mod_fastcgi, mod_fcgid, mod_scgi and mod_cgi(d), users can set such
>> custom variables using the SetEnv directive. Thus one might say:
>>
>>   SetEnv trac.env_path /usr/local/trac/site-1
>>
>> If the rule is that everything in WSGI environment coming from WSGI
>> adapter must be bytes then you have a potential for mismatch in
>> expectations of how values will be passed. That is, if set using
>> SetEnv then would be bytes, but if set using WSGI middleware wrapper
>> for configuration, more likely going to be string. It would seem
>> overly onerous to expect WSGI middleware to use bytes for
>> configuration variables as well and so force all consumers to always
>> be converting to string using appropriate encoding, where required
>> encoding potentially unknown.
>
> Is it reasonable to expect configuration variable to have a certain
> type? I am tempted to say 'no', but that's because I like the "everything
> is bytes" approach so much :) I don't have any experience with
> configuration variables passed via the WSGI environment though.
>
> But it could be quite a problem, for example 'Developer authentication'
> posted a month ago by Ian Bicking requires its configuration variable to
> be a string, but I don't think this spec applies to WSGI on Py3K or WSGI
> 2.
>
>> This is why I specifically asked previously, and which no one has
>> answered, if bytes is to be used, which variables in WSGI environment
>> should be passed as bytes. If there is a known specified list of
>> variables which it is known will always be bytes, may be more
>> manageable. If someone is going to suggest that only CGI variables
>> should be bytes, then what does that actually mean. Remember that for
>> FASTCGI, SCGI, CGI there isn't really a distinction and so where the
>> boundary is as to what is a CGI variable is fuzzy although you could
>> reverse transformation and get back bytes if know what to do it for.
>>
>> One could restrict use of bytes to just SCRIPT_NAME, PATH_INFO and
>> QUERY_STRING and maybe that will suffice. It may not though, because
>> what about headers such as HTTP_REFERRER? Also, what about additional
>> SSL_? variables that a SSL module for web sever may add?
>
> What you are proposing in 'black-listing' some variables known to cause
> problems.
>
> It will be difficult to come up with an exhaustive list of variables
> with different encoding. Even if we were able to come up with such a
> list, it creates 2 different cases and could end up complicate
> application developer's life. That's why the approach "everything coming
> from the server/gateway is bytes" makes sense, it is simpler to explain,
> it is simpler to understand, and it's, I think, more pythonic (There
> should be one-- and preferably only one --obvious way to do it.)
>
> Just consider the case of cookies, I don't know if you can use non-ASCII
> character in them, but it possible that it will mess up "everything is
> string expect a, b, c" if we forget to include it in the list.
> "Everything is bytes" is in this sense more future-proof than
> "black-listing a, b, c". If a variable with a weird encoding appears a
> few month after the new PEP is released, "everything is bytes" still
> works, but the "black-list" approach stops working.
>
>
> Cheers,
>
> --
>  Henry Prêcheur
>