[Web-SIG] WSGI for Python 3

Sat Jul 17 08:26:37 CEST 2010

On Saturday, July 17, 2010, Graham Dumpleton <graham.dumpleton at gmail.com> wrote:
> On Saturday, July 17, 2010, Ian Bicking <ianb at colorstudy.com> wrote:
>> On Fri, Jul 16, 2010 at 1:40 PM, P.J. Eby <pje at telecommunity.com> wrote:
>>
>>
>> At 11:07 AM 7/16/2010 -0500, Ian Bicking wrote:
>>
>> And this doesn't help with Python 3: either we have byte values of SCRIPT_NAME and PATH_INFO in Python 3, or we have text values.Â  I think bytes will be more awkward to port to than text, and inconsistent with other WSGI values.
>>
>>
>> OTOH, it has the tremendous advantage of pushing the encoding question onto the app (or framework) developer...  who's really the only one who can make the right decision for their particular application.  And personally, I'd rather have clear boundaries between text and bytes, such that porting (even if tedious or awkward) is *consistent*, and clear as to when you're finished, not, "oh, did I check to make sure I converted SCRIPT_NAME and PATH_INFO...  not just in my app code, but in all the library code I call *from* my app?"
>>
>> IOW, the bytes/string discussion on Python-dev has kind of led me to realize that we might just as well make the *entire* stack bytes (incoming and outgoing headers *and* streams), and rewrite that bit in PEP 333 about using str on "Python 3000" to say we go with bytes on Python 3+ for everything that's a str in today's WSGI.
>>
>> This was my first intuition too, until I started thinking in more detail about the particular values involved.  Some obviously are textish, like environ['SERVER_NAME'].  Not a very useful value, but definitely text.
>>
>> Basically all the internal strings are textish, so we're left with:
>>
>> wsgi.url_scheme
>> SCRIPT_NAME/PATH_INFO
>> QUERY_STRING
>> HTTP_*, CONTENT_TYPE, CONTENT_LENGTH (headers)
>> response status
>> response headers (name and value)
>>
>> And there's a few things like REMOTE_USER that are kind of in the middle.  Everyone is in agreement that bodies should be bytes.
>>
>> One initial problem is that the Python 3 stdlib handles bytes poorly, so for instance there's no good way to reconstruct the URL using the stdlib.  That explains certain tensions, but I think we should ignore that, and in fact that's what Python-Dev seemed to say pretty clearly.
>>
>> Now, the other keys:
>>
>> wsgi.url_scheme: clearly ASCII
>>
>> SCRIPT_NAME/PATH_INFO: often UTF-8, could be no encoding, could be some old legacy encoding.
>> raw request path: should be ASCII (non-ASCII should be URL-encoded).  URL encoding happens at the byte layer, so a server could reasonably URL encode any non-ASCII characters without imposing any  encoding.
>>
>> QUERY_STRING: should be ASCII, same as raw request path
>>
>> headers: Most are ASCII.  Latin1 is a reasonable fallback and suggested by the specification.  The spec also implies you have use the RFC2047 inline encoding (like ?iso-8859-1?q?some=20text?=), but nothing supports this and supporting it would probably be a bad idea for security reasons.  The Atompub spec (reasonably modern) specifically says Title headers should be encoded with RFC2047 (if they are not ISO-8859-1): http://tools.ietf.org/html/draft-ietf-atompub-protocol-08#page-17 -- decoding this kind of encoding at the application layer seems reasonable to me.
>>
>> cookie header: this specific header can easily have multiple encodings, as the browser encodes data then treats it as opaque bytes, so a cookie can be set via UTF-8 one place, Latin1 another, and those coexist in one header.  That is, there is no real encoding and this should be treated as bytes.  (Latin1 is an approximation of bytes... a spotty way to treat bytes, but entirely workable.)
>>
>> response status: I believe the spec says this must be Latin1/ISO-8859-1.  In practice it is almost always ASCII, and since it is not user-visible it's not something that really needs localization.
>>
>> response headers: the spec implies Latin1, in practice the Set-Cookie header is bytes (since interoperation with wonky legacy systems is not uncommon).  I'm not sure of any other exceptions?
>>
>>
>> So... to me it seems pretty reasonable for HTTP specifically that text can work.  And if feels weird that, say, environ['SERVER_NAME'] be text and environ['HTTP_HOST'] not, and I don't know what environ['REMOTE_ADDR'] should be in that mode.  And it would also be weird if environ['SERVER_NAME'] was bytes.
>>
>> In the past when we've gotten down to specifics, the only holdup has been SCRIPT_NAME/PATH_INFO, hence my suggestion to eliminate those.
>
> There were a few other weird ones which are though server specific.
> For example PATH_TRANSLATED (??). These are ones where again the
> server or operating system dictates the encoding due to them having
> bits in them deriving from things like filesystem paths and server
> configuration files. I laboriously went through all these in an email
> last year or earlier.
>
> Same reason why SCRIPT_NAME is really dictated by server and raw value
> perhaps should be going through to application.

s/should/shouldn't/

Graham