[Web-SIG] Python 3.0 and WSGI 1.0.

Thu Apr 16 09:12:11 CEST 2009

2009/4/4 Robert Brewer <fumanchu at aminus.org>:
> Alan Kennedy wrote:
>> [Bill]
>> > I think the controlling reference here is RFC 3875.
>>
>> I think the controlling references are RFC 2616, RFC 2396 and RFC
> 3987.
>>
>> RFC 2616, the HTTP 1.1 spec, punts on the question of character
>> encoding for the request URI.
>>
>> RFC 2396, the URI spec, says
>>
>> """
>>    It is expected that a systematic treatment of character encoding
>>    within URI will be developed as a future modification of this
>>    specification.
>> """
>>
>> RFC 3987 is that spec, for Internationalized Resource Identifiers. It
>> says
>>
>> """
>> An IRI is a sequence of characters from the Universal Character Set
>> (Unicode/ISO 10646).
>> """
>>
>> and
>>
>> """
>> 1.2.  Applicability
>>
>>    IRIs are designed to be compatible with recommendations for new URI
>>    schemes [RFC2718].  The compatibility is provided by specifying a
>>    well-defined and deterministic mapping from the IRI character
>>    sequence to the functionally equivalent URI character sequence.
>>    Practical use of IRIs (or IRI references) in place of URIs (or URI
>>    references) depends on the following conditions being met:
>> """
>>
>> followed by
>>
>> """
>>    c.  The URI corresponding to the IRI in question has to encode
>>        original characters into octets using UTF-8.  For new URI
>>        schemes, this is recommended in [RFC2718].  It can apply to a
>>        whole scheme (e.g., IMAP URLs [RFC2192] and POP URLs [RFC2384],
>>        or the URN syntax [RFC2141]).  It can apply to a specific part
>> of
>>        a URI, such as the fragment identifier (e.g., [XPointer]).  It
>>        can apply to a specific URI or part(s) thereof.  For details,
>>        please see section 6.4.
>> """
>>
>> I think the question is "are people using IRIs in the wild"? If so,
>> then we must decide how do we best deal with the problems of
>> recognising iso-8859-1+rfc2037 versus utf-8, or whatever
>> server-configured encoding the user has chosen.
>
> Agreed. The Request-URI needs to handle IRI's. The headers mostly
> don't--almost all headers are of mostly type "token", which is US-ASCII.
> A few are of type "TEXT", which is ISO-8859-1/RFC 2047. The remaining
> (sub)values are mostly custom byte sequences:
>
> field-name           field-value
> ----------           -----------
> Accept               token
> Accept-Charset       token
> Accept-Encoding      token
> Accept-Language      ALPHA, plus ":", "=", "q" etc
> Accept-Ranges        token
> Age                  DIGIT
> Allow                token
> Authorization        token
> Cache-Control        token
> Connection           token
> Content-Encoding     token
> Content-Language     ALPHA
> Content-Length       DIGIT
> Content-Location     absoluteURI | relativeURI
> Content-MD5          base64 of 128 bit md5 digest
> Content-Range        DIGIT, plus "/" etc
> Content-Type         token
> Date                 HTTP-date
> ETag                 TEXT and CHAR
> Expect               token, quoted-string
> Expires              HTTP-date
> >From                 ASCII (see RFC 822)
> Host                 host ":" port
> If-Match             TEXT and CHAR
> If-Modified-Since    HTTP-date
> If-None-Match        TEXT and CHAR
> If-Range             TEXT and CHAR | HTTP-date
> If-Unmodified-Since  HTTP-date
> Last-Modified        HTTP-date
> Location             absoluteURI
> Max-Forwards         DIGIT
> Pragma               token, quoted-string
> Proxy-Authenticate   token
> Proxy-Authorization  token
> Range                token
> Referer              absoluteURI | relativeURI
> Retry-After          HTTP-date | DIGIT
> Server               token, TEXT
> TE                   token
> Trailer              token
> Transfer-Encoding    token
> Upgrade              token
> User-Agent           token, TEXT
> Vary                 token
> Via                  token, host, port
> Warning              quoted-string, HTTP-date, host, port
> WWW-Authenticate     token
>
>
> The Content-Location, Location, and Referer headers are problematic
> since HTTP borrows those from the URI spec, which deals in characters
> and not bytes, as you mentioned. Host, and maybe Via, are also special
> due to possible IDNA-encoding.
>
> Regarding extension headers, I think we should assume that the HTTP/1.1
> spec implies all headers should be token (ASCII) or TEXT (ISO-8859-1).
> >From section 4.2:
>
>    field-content  = <the OCTETs making up the field-value
>                     and consisting of either *TEXT or combinations
>                     of token, separators, and quoted-string>
>
> In addition, the httpbis effort seems to be enforcing this even more
> strongly [1]:
>
>     message-header = field-name ":" OWS [ field-value ] OWS
>     field-name     = token
>     field-value    = *( field-content / OWS )
>     field-content  = *( WSP / VCHAR / obs-text )
>
>   Historically, HTTP has allowed field-content with text in the ISO-
>   8859-1 [ISO-8859-1] character encoding (allowing other character sets
>   through use of [RFC2047] encoding).  In practice, most HTTP header
>   field-values use only a subset of the US-ASCII charset [USASCII].
>   Newly defined header fields SHOULD constrain their field-values to
>   US-ASCII characters.  Recipients SHOULD treat other (obs-text) octets
>   in field-content as opaque data.
>
> So, from where I sit, we have:
>
>  1. Many header values which are ASCII.
>  2. A few header values which are ISO-8859-1 plus RFC 2047.
>  3. A few header values which are URI's (no specified encoding) or IRI's
> (UTF-8).
>
> I understand the desire to decode ASAP, and I agree with Guido that we
> should use a default encoding which the app can override. Looking at the
> above, ISO-8859-1 is the best encoding I know of for all three header
> cases. ASCII can be used as a valid subset without transcoding; headers
> which are ISO-8859-1 are decoded perfectly; URI/IRI headers can be
> transcoded by the app if needed, but mangled opaquely by middleware.
>
> If we make *that* call, then IMO there's no reason not to do the same to
> SCRIPT_NAME, PATH_INFO, and QUERY_STRING.

I am not sure we ended up with a final answer on all of this, but I
don't want to hold up mod_wsgi 3.0, which includes Python 3.0 support,
any longer. As such, am implementing things as per:

  http://www.wsgi.org/wsgi/Amendments_1.0

with exception that will not be attempting to do decoding per RFC
2047. Any CGI variables not related to HTTP headers will also be
handled as latin-1, including SCRIPT_NAME, PATH_INFO and QUERY_STRING.
This should be equivalent with what wsgiref does in Python 3.X and
basically keeps the status quo.

If anyone has any last things to say on all of this, please speak up now.

Graham