[Web-SIG] WSGI 2

Tue Aug 4 06:28:58 CEST 2009

2009/8/4 P.J. Eby <pje at telecommunity.com>:
>> 5. When running under Python 3, servers MUST provide CGI HTTP and
>> server variables as strings. Where such values are sourced from a byte
>> string, be that a Python byte string or C string, they should be
>> converted as 'UTF-8'. If a specific web server infrastructure is able
>> to support different encodings, then the WSGI adapter MAY provide a
>> way for a user of the WSGI adapter to customise on a global basis, or
>> on a per value basis what encoding is used, but this is entirely
>> optional. Note that there is no requirement to deal with RFC 2047.
>>
>> This is where I am going to diverge from what has been discussed before.
>>
>> The reason I am going to pass as UTF-8 and not latin-1 is that it
>> looks like Apache effectively only supports use of UTF-8. Since this
>> means that mod_wsgi, and Apache modules for FASTCGI, SCGI, AJP and
>> even CGI likely cannot handle anything besides UTF-8 then I really
>> can't see the point of trying to cater for a theoretical possibility
>> that some HTTP client could use something besides UTF-8. In other
>> words, the predominant case will be UTF-8, so let us target that.
>>
>> So, rather than burden every WSGI application with the need to convert
>> from latin-1 back to bytes and then to UTF-8, let the server deal with
>> it, with server using sensible default, and where server
>> infrastructure can handle a different encoding, then it can provide
>> option to use that encoding and WSGI application doesn't need to
>> change.
>
> Maybe I'm missing something here, but what if Apache receives something
> encoded in Latin-1?  AFAIR, form POST encoding is determined by the encoding
> of the page containing the form; that's of course something that only
> happens in the input body, but what about URLs?
>
> Mainly I'm wondering, what should the server do in the event they receive a
> byte string which is not valid UTF-8?  (Latin-1 doesn't have this problem,
> since there's no such thing as an invalid Latin-1 string, at least not at
> the encoding level.)

Can you clarify. We aren't talking about request content here. The
wsgi.input stream is still binary and up to WSGI application to decode
how it decides it should be decoded.

The only related thing I can think you are talking about is the form
target URL, which is an issue for GET and POST requests, or other
method types, from a form.

>> Also shown though that SCRIPT_NAME part has to be UTF-8
>> and we would really be entering fantasy land if you were somehow going
>> to cope with some different encoding for PATH_INFO and QUERY_STRING.
>> Instead it is like the GPL, viral in nature. Use of UTF-8 in one
>> particular area means you are effectively bound to use UTF-8
>> everywhere else.
>
> I'm not clear on your logic here.  If I request foo/bar/baz (where baz
> actually has an accent over the 'a') in latin-1 encoding, and foo/bar is the
> script, then the (accented) baz is legitimate for pass-through to the
> application, no?

Technically, but what I am pointing out is that Apache pretty well
says that foo/bar needs to be UTF-8. If you are going to have
different parts of the one URL needing a different encoding to be
understood, personally I would say you asking for trouble. So, am
saying that UTF-8 needs to really apply more for sake of sanity and
portability.

> I just tried testing this with Firefox and Apache, and found that you can in
> fact pass such Latin-1 strings through to PATH_INFO, but at least in the
> case of Firefox, you have to %-escape them.  However, they are seen by
> Python (via os.environ) as latin-1 encoded byte strings.

By using % escapes you are in practice overriding the encoding that
the browser may be applying to URL if given raw character? What
happens if you were to paste the accented character direct into the
browser URL bar? Browsers I have played with would normally
automatically translate that as UTF-8 and send it as such, with %
encoding as necessary.

So I guess the problem is more where URLs are already % encoded when
coming back as href or form action because they may be in an encoding
incompatible with UTF-8 if it were to be clicked on.

>> Further example of why UTF-8 reaches into everything is mod_rewrite
>> module for Apache. This allows you to do stuff related to SCRIPT_NAME,
>> PATH_INFO and QUERY_STRING parts of a URL. Already shown that Apache
>> configuration file has to be UTF-8. If URL isn't, then wouldn't be
>> possible to perform matches against non latin-1 characters in a
>> rewrite condition or rule. This is because your match string would be
>> in different encoded form to that in URL and so wouldn't match.
>
> Note that this still doesn't have any impact on the bytes that actually
> reach the application, which can be non-UTF8.  At minimum, the proposal is
> underspecified as to how to handle this case, which is as trivial to
> generate as sticking a %-escape in the PATH_INFO or QUERY_STRING portion(s)
> of a URL.

The Apache server at least will decode those % escape sequence and I
believe it is the result of that which is used in stuff like rewrite
rule matches, not the raw URL. The only exception would be if rewrite
rule explicit matched against REQUEST_URI variable which still
contains % escape sequences. So if not in UTF-8, means effectively
that you can't then match them with Apache rewrite rules then.

Graham