[Web-SIG] WSGI input filter that changes content length.

Mon Jan 15 12:49:49 CET 2007

Alan Kennedy wrote ..
> [Graham Dumpleton]
> > How does one implement in WSGI an input filter that manipulates the request
> > body in such a way that the effective content length would be changed?
> 
> > The problem I am trying to address here is how one might implement using
> WSGI a
> > decompression filter for the body of a request. Ie., where "Content-Encoding:
> > gzip" has been specified.
> 
> > So, how is one meant to deal with this in WSGI?
> 
> The usual approach to modifying something something in the WSGI
> environment, in this case the wsgi.input file-like object, is to wrap
> it or replace it with an object that behaves as desired.
> 
> In this case, the approach I would take would be to wrap the
> wsgi.input object with a gzip.GzipFile object, which should only read
> the input stream data on demand. The code would look like this
> 
> import gzip
> wsgi_env['wsgi.input'] = gzip.GzipFile(wsgi_env['wsgi.input'])
> 
> Notes.
> 
> 1. The application should be completely unaware that it is dealing
> with a compressed stream: it simply reads from wsgi.input, unaware
> that reading from what it thinks the input stream is actually causing
> cascading reads down a series of file-like objects.
> 
> 2. The GzipFile object will decompress on the fly, meaning that it
> will only read from the wrapped input stream when it needs input.
> Which means that if the application does not read data from
> wsgi.input, then no data will be read from the client connection.

Hmmm, maybe I should have phrased my question a bit differently as to be
honest I am not actually interested in doing on the fly decompression and
only used it as an example. I really only want to know about how the
content length is supposed to be dealt with. I didn't want to explain the
actual context for the question as didn't want to let on yet to what I am up
to, so used an example which I thought would illustrate the problem.

> 3. The GzipFile should not be responsible for enforcement of the
> incoming Content-Length boundary. Instead, this should be enforced by
> the original server-provided file-like input stream that it wraps. So
> if the application attempts to read past Content-Length bytes, the
> server-provided input stream "is allowed to simulate an end-of-file
> condition". Which would cause the GzipFile to return an EOF to the
> application, or possibly an exception.
>
> 4. Because of the on-the-fly nature of the GzipFile decompression, it
> would not be possible to provide a meaningful Content-Length value to
> the application. To do so would require buffering and decompressing
> the entire input data stream. But the application should still be able
> to operate without knowing Content-Length.

I am not sure this fully answers what I want to know. If I leave the
content length header as is and any application does a
read(content_length) and decompression or some other input filter
actually results in more data than that being available, the application
will not get it all as it has only asked to read the original length
before decompression.

The PEP says that an application though should not attempt to read more
data than has been specified by the content length. If it is common
practice that applications take this literally and always get data from
the input by using read(content_length) then there is a requirement that
the content length header must exist. Thus, if the input filter does zap
the content length header and remove it then an application which does
that will not work.

Thus the question probably is, what is accepted practice or what does
the PEP dictate as to how applications should use read()?

Is in accepted as the norm that applications will always do
read(content_length) and thus zapping the content length is
unacceptable, or for where an application doesn't need to know the
content length up front, for example except where it needs to pass it
downstream like proxy in paste, would applications always just use
read() with no argument and just get all data, or at worst read it in
chunks until read() returns an empty string. BTW, yes I know that they
could use readline(), readlines() or __iter__(), but lets look at this just
in terms of read() for now.

So, is it okay to remove the content length header when there is actually
data and I know it wouldn't actually be correct, or does that result in a
situation that is seen as violating the PEP or even if acceptable would break
existing WSGI applications.

Or in short, is it mandatory that content length header must exist if there is
non zero length data in input? I know the PEP says that the content length
may be empty or absent, but am concerned that applications would assume
it has value of 0 if empty or absent.

> 5. The wrapping can NOT be done in middleware. PEP 333, Section "Other
> HTTP Features" has this to say: "WSGI applications must not generate
> any "hop-by-hop" headers [4], attempt to use HTTP features that would
> require them to generate such headers, or rely on the content of any
> incoming "hop-by-hop" headers in the environ dictionary. WSGI servers
> must handle any supported inbound "hop-by-hop" headers on their own,
> such as by decoding any inbound Transfer-Encoding, including chunked
> encoding if applicable." So the wrapping and replacement of wsgi.input
> should happen in the server or gateway, NOT in middleware.
> 
> 6. Exactly the same principles should apply to decoding incoming
> Transfer-Encoding: chunked.

My understanding is that content encoding is different to transfer encoding,
ie., is not hop by hop in this sense and that the same statements don't apply.
I could well be wrong though. But even if this is the case, the underlying
server itself may not be able to guarantee that the content length header
itself is valid if it is doing the decompression using its own filter. Thus, it
may itself way to zap the content length header before anything is even
handed off to a WSGI stack. Therefore at the very outset the root application
may get no content length header but there is still data to read. I know this
may cause issues if an application checks for a content length header and
if not found raises the HTTP error response indicating that length is required,
but ignoring that, if in general to use read() content length header must always
exist, then it effectively means that an underlying web server can never use
any input filters of its own which would change such things as the content
length of data.

If this is the case, then it seems that a WSGI adapter for a specific web server,
if it can detect that the web server is going to apply a filter of its own which is going
to change the content length, that it possibly should respond with some sort
of error before it even hands it off to the WSGI stack so as to avoid problems.

In other words, the adapter should flag a configuration issue with an error
to cause the server admin to ensure that all web server input filters are disabled for
URLs that are being passed through to a WSGI application. Ie., leave everything up
to WSGI and not try and do things itself. But then if one does leave everything
up to WSGI, then how to solve the issue of how it can implement decompression
itself and will zapping the content length cause failure of existing applications,
thus back to my original question.

Hope you can follow what I am going on about.

> P.S. Thanks for all your great work on mod_python Graham!

Wait till you see what I am about to come out with if I can sort this issue out. :-)

Graham