[Web-SIG] WSGI for Python 3
Tres Seaver
tseaver at palladion.com
Fri Jul 16 23:47:40 CEST 2010
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Ian Bicking wrote:
>> IOW, the bytes/string discussion on Python-dev has kind of led me to
>> realize that we might just as well make the *entire* stack bytes (incoming
>> and outgoing headers *and* streams), and rewrite that bit in PEP 333 about
>> using str on "Python 3000" to say we go with bytes on Python 3+ for
>> everything that's a str in today's WSGI.
>>
>
> This was my first intuition too, until I started thinking in more detail
> about the particular values involved. Some obviously are textish, like
> environ['SERVER_NAME']. Not a very useful value, but definitely text.
>
> Basically all the internal strings are textish, so we're left with:
What do you mean by "internal"? Anything in the headers or the CGI
environment is intrinsically "bytes-ish" to me. Do you mean that you
want application programmers to have them transparently decoded? If so,
we can make that the responsibility of the non-middleware framework /
application.
> wsgi.url_scheme
> SCRIPT_NAME/PATH_INFO
> QUERY_STRING
> HTTP_*, CONTENT_TYPE, CONTENT_LENGTH (headers)
> response status
> response headers (name and value)
>
> And there's a few things like REMOTE_USER that are kind of in the middle.
> Everyone is in agreement that bodies should be bytes.
>
> One initial problem is that the Python 3 stdlib handles bytes poorly, so for
> instance there's no good way to reconstruct the URL using the stdlib. That
> explains certain tensions, but I think we should ignore that, and in fact
> that's what Python-Dev seemed to say pretty clearly.
python-dev seems to me to be coming to the realization that they should
have tried harder to make real-world apps work before they froze their
choices.
> Now, the other keys:
>
> wsgi.url_scheme: clearly ASCII
>
> SCRIPT_NAME/PATH_INFO: often UTF-8, could be no encoding, could be some old
> legacy encoding.
> raw request path: should be ASCII (non-ASCII should be URL-encoded). URL
> encoding happens at the byte layer, so a server could reasonably URL encode
> any non-ASCII characters without imposing any encoding.
>
> QUERY_STRING: should be ASCII, same as raw request path
>
> headers: Most are ASCII. Latin1 is a reasonable fallback and suggested by
> the specification. The spec also implies you have use the RFC2047 inline
> encoding (like ?iso-8859-1?q?some=20text?=), but nothing supports this and
> supporting it would probably be a bad idea for security reasons. The
> Atompub spec (reasonably modern) specifically says Title headers should be
> encoded with RFC2047 (if they are not ISO-8859-1):
> http://tools.ietf.org/html/draft-ietf-atompub-protocol-08#page-17 --
> decoding this kind of encoding at the application layer seems reasonable to
> me.
>
> cookie header: this specific header can easily have multiple encodings, as
> the browser encodes data then treats it as opaque bytes, so a cookie can be
> set via UTF-8 one place, Latin1 another, and those coexist in one header.
> That is, there is no real encoding and this should be treated as bytes.
> (Latin1 is an approximation of bytes... a spotty way to treat bytes, but
> entirely workable.)
>
> response status: I believe the spec says this must be Latin1/ISO-8859-1. In
> practice it is almost always ASCII, and since it is not user-visible it's
> not something that really needs localization.
>
> response headers: the spec implies Latin1, in practice the Set-Cookie header
> is bytes (since interoperation with wonky legacy systems is not uncommon).
> I'm not sure of any other exceptions?
>
>
> So... to me it seems pretty reasonable for HTTP specifically that text can
> work. And if feels weird that, say, environ['SERVER_NAME'] be text and
> environ['HTTP_HOST'] not, and I don't know what environ['REMOTE_ADDR']
> should be in that mode. And it would also be weird if
> environ['SERVER_NAME'] was bytes.
> In the past when we've gotten down to specifics, the only holdup has been
> SCRIPT_NAME/PATH_INFO, hence my suggestion to eliminate those.
I think I favor PJE's suggestion: let WSGI deal only in bytes.
Tres.
- --
===================================================================
Tres Seaver +1 540-429-0999 tseaver at palladion.com
Palladion Software "Excellence by Design" http://palladion.com
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
iEYEARECAAYFAkxA03wACgkQ+gerLs4ltQ7x0gCg03P1cT9RsJhagBERqY6SbLQ8
zu0An0T0YoFjzAb+2WjWp20DS3VeP68u
=ybUr
-----END PGP SIGNATURE-----
More information about the Web-SIG
mailing list