[Web-SIG] Python 3.0 and WSGI 1.0.
Robert Brewer
fumanchu at aminus.org
Fri Apr 3 20:35:20 CEST 2009
Alan Kennedy wrote:
> [Bill]
> > I think the controlling reference here is RFC 3875.
>
> I think the controlling references are RFC 2616, RFC 2396 and RFC
3987.
>
> RFC 2616, the HTTP 1.1 spec, punts on the question of character
> encoding for the request URI.
>
> RFC 2396, the URI spec, says
>
> """
> It is expected that a systematic treatment of character encoding
> within URI will be developed as a future modification of this
> specification.
> """
>
> RFC 3987 is that spec, for Internationalized Resource Identifiers. It
> says
>
> """
> An IRI is a sequence of characters from the Universal Character Set
> (Unicode/ISO 10646).
> """
>
> and
>
> """
> 1.2. Applicability
>
> IRIs are designed to be compatible with recommendations for new URI
> schemes [RFC2718]. The compatibility is provided by specifying a
> well-defined and deterministic mapping from the IRI character
> sequence to the functionally equivalent URI character sequence.
> Practical use of IRIs (or IRI references) in place of URIs (or URI
> references) depends on the following conditions being met:
> """
>
> followed by
>
> """
> c. The URI corresponding to the IRI in question has to encode
> original characters into octets using UTF-8. For new URI
> schemes, this is recommended in [RFC2718]. It can apply to a
> whole scheme (e.g., IMAP URLs [RFC2192] and POP URLs [RFC2384],
> or the URN syntax [RFC2141]). It can apply to a specific part
> of
> a URI, such as the fragment identifier (e.g., [XPointer]). It
> can apply to a specific URI or part(s) thereof. For details,
> please see section 6.4.
> """
>
> I think the question is "are people using IRIs in the wild"? If so,
> then we must decide how do we best deal with the problems of
> recognising iso-8859-1+rfc2037 versus utf-8, or whatever
> server-configured encoding the user has chosen.
Agreed. The Request-URI needs to handle IRI's. The headers mostly
don't--almost all headers are of mostly type "token", which is US-ASCII.
A few are of type "TEXT", which is ISO-8859-1/RFC 2047. The remaining
(sub)values are mostly custom byte sequences:
field-name field-value
---------- -----------
Accept token
Accept-Charset token
Accept-Encoding token
Accept-Language ALPHA, plus ":", "=", "q" etc
Accept-Ranges token
Age DIGIT
Allow token
Authorization token
Cache-Control token
Connection token
Content-Encoding token
Content-Language ALPHA
Content-Length DIGIT
Content-Location absoluteURI | relativeURI
Content-MD5 base64 of 128 bit md5 digest
Content-Range DIGIT, plus "/" etc
Content-Type token
Date HTTP-date
ETag TEXT and CHAR
Expect token, quoted-string
Expires HTTP-date
>From ASCII (see RFC 822)
Host host ":" port
If-Match TEXT and CHAR
If-Modified-Since HTTP-date
If-None-Match TEXT and CHAR
If-Range TEXT and CHAR | HTTP-date
If-Unmodified-Since HTTP-date
Last-Modified HTTP-date
Location absoluteURI
Max-Forwards DIGIT
Pragma token, quoted-string
Proxy-Authenticate token
Proxy-Authorization token
Range token
Referer absoluteURI | relativeURI
Retry-After HTTP-date | DIGIT
Server token, TEXT
TE token
Trailer token
Transfer-Encoding token
Upgrade token
User-Agent token, TEXT
Vary token
Via token, host, port
Warning quoted-string, HTTP-date, host, port
WWW-Authenticate token
The Content-Location, Location, and Referer headers are problematic
since HTTP borrows those from the URI spec, which deals in characters
and not bytes, as you mentioned. Host, and maybe Via, are also special
due to possible IDNA-encoding.
Regarding extension headers, I think we should assume that the HTTP/1.1
spec implies all headers should be token (ASCII) or TEXT (ISO-8859-1).
>From section 4.2:
field-content = <the OCTETs making up the field-value
and consisting of either *TEXT or combinations
of token, separators, and quoted-string>
In addition, the httpbis effort seems to be enforcing this even more
strongly [1]:
message-header = field-name ":" OWS [ field-value ] OWS
field-name = token
field-value = *( field-content / OWS )
field-content = *( WSP / VCHAR / obs-text )
Historically, HTTP has allowed field-content with text in the ISO-
8859-1 [ISO-8859-1] character encoding (allowing other character sets
through use of [RFC2047] encoding). In practice, most HTTP header
field-values use only a subset of the US-ASCII charset [USASCII].
Newly defined header fields SHOULD constrain their field-values to
US-ASCII characters. Recipients SHOULD treat other (obs-text) octets
in field-content as opaque data.
So, from where I sit, we have:
1. Many header values which are ASCII.
2. A few header values which are ISO-8859-1 plus RFC 2047.
3. A few header values which are URI's (no specified encoding) or IRI's
(UTF-8).
I understand the desire to decode ASAP, and I agree with Guido that we
should use a default encoding which the app can override. Looking at the
above, ISO-8859-1 is the best encoding I know of for all three header
cases. ASCII can be used as a valid subset without transcoding; headers
which are ISO-8859-1 are decoded perfectly; URI/IRI headers can be
transcoded by the app if needed, but mangled opaquely by middleware.
If we make *that* call, then IMO there's no reason not to do the same to
SCRIPT_NAME, PATH_INFO, and QUERY_STRING.
Robert Brewer
fumanchu at aminus.org
[1]
http://www.ietf.org/internet-drafts/draft-ietf-httpbis-p1-messaging-06.t
xt
More information about the Web-SIG
mailing list