[Web-SIG] WSGI input filter that changes content length.

Mon Jan 15 13:47:24 CET 2007

[Graham]
> Hmmm, maybe I should have phrased my question a bit differently as to be
> honest I am not actually interested in doing on the fly decompression and
> only used it as an example. I really only want to know about how the
> content length is supposed to be dealt with. I didn't want to explain the
> actual context for the question as didn't want to let on yet to what I am up
> to, so used an example which I thought would illustrate the problem.

Point taken. But I think gzip encoding is a good example to illustrate
the issues.

[Graham]
> If I leave the
> content length header as is and any application does a
> read(content_length) and decompression or some other input filter
> actually results in more data than that being available, the application
> will not get it all as it has only asked to read the original length
> before decompression.

So obviously the Content-Length header cannot be left unmodified if
some transformation is in place that is altering the length of the
content.

There are two choices for how the wrapping should happen.

1. The ungzipping filter reads the entirety of the (possibly huge)
input, decompresses it, and makes it available in wsgi.input. The
Content-Length header is rewritten to reflect the length of the
decompressed content. The client has a valid Content-Length value, but
the server has had to buffer a potentially large input stream in order
to be able to provide that.

2. The ungzipping filter wraps the compressed stream, and decompresses
on demand and on-the-fly. In this case, it *must* delete the old
Content-Length header, which is now invalid. It cannot provide a new
value for Content-Length, since the final uncompressed length of the
input stream cannot be known.

[Graham]
> The PEP says that an application though should not attempt to read more
> data than has been specified by the content length. If it is common
> practice that applications take this literally and always get data from
> the input by using read(content_length) then there is a requirement that
> the content length header must exist. Thus, if the input filter does zap
> the content length header and remove it then an application which does
> that will not work.

Then I suppose that that application is not a fully-compliant WSGI application.

Scenario 2 outlined above is a perfectly valid scenario that can
happen, so an application that cannot deal with that scenario is not
robust.

> Thus the question probably is, what is accepted practice or what does
> the PEP dictate as to how applications should use read()?

AFAICT, the PEP is not prescriptive about the use of the
wsgi.input.read() method.

However, given that you have found it necessary to raise the question,
perhaps it should be added to the WSGI PEP that absence of a
Content-Length header does NOT imply absence of content.

[Graham]
> So, is it okay to remove the content length header when there is actually
> data and I know it wouldn't actually be correct,

I would say it's compulsory to remove the header: it contains an
incorrect value, and if the application uses that value, it will get
unexpected data or an exception, and rightly so.

[Graham]
> or does that result in a
> situation that is seen as violating the PEP or even if acceptable would break
> existing WSGI applications.

I would say that leaving an incorrect value in place should be a
violation of the PEP.

> Or in short, is it mandatory that content length header must exist if there is
> non zero length data in input? I know the PEP says that the content length
> may be empty or absent, but am concerned that applications would assume
> it has value of 0 if empty or absent.

No, the Content-Length header is optional, and any applications that
operate otherwise are non-compliant.

[Alan]
>> 6. Exactly the same principles should apply to decoding incoming
>> Transfer-Encoding: chunked.

[Graham]
> My understanding is that content encoding is different to transfer encoding,
> ie., is not hop by hop in this sense and that the same statements don't apply.

Hop-by-hop header means that the attribute described in the header is
not an inherent attribute of the content being transferred, but is
solely used in one stage of a multi-hop communication.

If my browser is using a proxy, which relays requests on to a server,
the proxy may decide to use Transfer-Encoding to communicate with the
server. Thus the Transfer-Encoding only applies to the proxy->server
"hop". If the server receives such a Transfer-Encoding, it *must*
decode the content according to that Transfer-Encoding before making
it available to the application.

[Graham]
> Wait till you see what I am about to come out with if I can sort this issue out. :-)

Now I'm intrigued :-)

Regards,

Alan.