[Web-SIG] WSGI 2

Sun Aug 16 12:13:50 CEST 2009

2009/8/12 Ian Bicking <ianb at colorstudy.com>:
> On Tue, Aug 11, 2009 at 11:19 PM, Robert Brewer <fumanchu at aminus.org> wrote:
>>
>> > 5. When running under Python 3, servers MUST provide CGI HTTP and
>> > server variables as strings. Where such values are sourced from a byte
>> > string, be that a Python byte string or C string, they should be
>> > converted as 'UTF-8'. If a specific web server infrastructure is able
>> > to support different encodings, then the WSGI adapter MAY provide a
>> > way for a user of the WSGI adapter to customise on a global basis, or
>> > on a per value basis what encoding is used, but this is entirely
>> > optional. Note that there is no requirement to deal with RFC 2047.
>>
>> We're passing unicode for almost everything.
>>
>> REQUEST_METHOD and wsgi.url_scheme are parsed from the Request-Line, and
>> must be ascii-decodable. So are SERVER_PROTOCOL and our custom
>> ACTUAL_SERVER_PROTOCOL entries.
>>
>> The original bytes of the Request-URI are stored in REQUEST_URI. However,
>> PATH_INFO and QUERY_STRING are parsed from it, and decoded via a
>> configurable charset, defaulting to UTF-8. If the path cannot be decoded
>> with that charset, ISO-8859-1 is tried. Whichever is successful is stored at
>> environ['REQUEST_URI_ENCODING'] so middleware and apps can transcode if
>> needed. Our origin server always sets SCRIPT_NAME to '', but if we populated
>> it, we would make it decoded by the same charset.
>
> My understanding is that PATH_INFO *should* be UTF-8 regardless of what
> encoding a page might be in.  At least that's what I got when testing
> Firefox.  It might not be valid UTF-8 if it was manually constructed, but
> then there's little reason to think it is valid anything; only the bytes or
> REQUEST_URI are likely to be an accurate representation.  (Frankly I wish
> PATH_INFO was not url-decoded, which would remove this issue entirely --
> REQUEST_URI, or any url-encoded value, should really be ASCII, and I don't
> know of reasonable cases where this wouldn't be true.)
> I suppose ISO-8859-1 is a reasonable fallback in this case, as it can be
> used to kind of reconstruct the original request path (the surrogateescape
> or whatever it is called would serve the same purpose, but is only available
> in Python 3).

Thinking about it for a while, I get the feel that having a fallback
to latin-1 if string cannot be decoded as UTF-8 is a bad idea. That
URLs wouldn't consistently use the same encoding all the time just
seems wrong. I would see it as returning a bad request status. If an
application coder knows they are actually going to be dealing with
latin-1, as that is how the application is written, then they should
be specifying it should be latin-1 always instead of utf-8. Thus, the
WSGI adapter should provide a means to override what encoding is used.
For simple WSGI adapters which only service one WGSI application, then
it would apply to whole URL namespace. For something like Apache where
could map to multiple WSGI applications, then it may want to provide
means of overriding encoding for specific subsets o URLs, ie., using
Location directive for example.

Graham