A more Twisted approach to async apps in WSGI
Hi all. I've been away for a few days due to loss of e-mail service when my dedicated server lost a hard drive. Unfortunately my ISP didn't support the OS version any more, so I had to rebuild everything for the new OS version. Anyway, on to the topic of my post. Should 'wsgi.input' become an iterator? Or should we develop a different API for asynchronous applications? On the positive side of the iterator approach, it could make it easier for asynchronous applications to pause waiting for input, and it could in principle support "chunked" transfer encoding of the input stream. However, since we last discussed this, I did some Googling on CGI and chunked encoding. By far and away, the most popular links regarding chunked encoding and CGI, are all about bugs in IIS and Apache leading to various vulnerabilities when chunked encoding is used. :( Once you get past those items (e.g. by adding "-IIS -vulnerability" to your search), you then find *our* discussion here on the Web-SIG! Finally, digging further, I found some 1998 discussion from the IPP (Internet Printing Protocol!) mailing list about what HTTP/1.1 servers support chunked encoding for CGI and which don't. Anyway, the long and short of it is that CGI and chunked encoding are quite simply incompatible, which means that relying on its availability would be nonportable in a WSGI application anyway. That leaves the asynchronous use case, but the benefit is rather strained at that point. Many frameworks reuse the 'cgi' module's 'FieldStorage' class in order to parse browser input, and the 'cgi' module's implementation requires an object with a 'readline()' method. That means that if we switch from an input stream to an iterator, a lot of people are going to be trying to make sensible wrappers to convert the iterator back to an input stream, and that's just getting ridiculous, especially since in many cases the server or gateway has a file-like object to start with. So, I'm thinking we should shift the burden to an async-specific API. But, in this case, "burden" means that we get to give asynchronous apps an API much more suited to their use cases. Suppose that we did something similar to 'wsgi.file_wrapper'? That is, suppose we had an optional extension that a server could provide, to wrap specialized application object(s) in a fashion that then provides backward compatibility to the spec? That is, suppose we had a 'wsgi.async_wrapper', used like this: if 'wsgi.async_wrapper' in environ: controller=environ['wsgi.async_wrapper'](environ) # do stuff with controller, like register its # methods as callbacks return controller The idea is that this would create an iterator that the server/gateway could recognize as "special", similar to the file-wrapper trick. But, the object returned would provide an extra API for use by the asynchronous application, maybe something like: put(data) -- queue data for retrieval when the controller is iterated over finish() -- mark the iterator finished, so it raises StopIteration on_get(length,callback) -- call 'callback(data)' when 'length' bytes are available on 'wsgi.input' (but return immediately from the 'on_get()' call) While this API is an optional extension, it seems it would be closer to what some async fans wanted, and less of a kludge. It won't do away with the possibility that middleware might block waiting for input, of course, but when no middleware is present or the middleware isn't transforming the input stream, it should work out quite well. In any case, the implementation of the methods and the iterator interface are pretty straightforward, either for synchronous or asynchronous servers. What do y'all think? I'd especially like feedback from Twisted folk, as to whether this looks anything like the right kind of API for async apps. (I expect it will need some tweaking and tuning.) But if this is the overall right approach, I'd like to drop the current proposals to make 'wsgi.input' an iterator and add optional 'pause'/'resume' APIs, since they were rather kludgy compared to giving async apps their own mini-API for nonblocking I/O. Comments? Questions?
"Phillip J. Eby" <pje@telecommunity.com> writes:
Hi all. I've been away for a few days due to loss of e-mail service when my dedicated server lost a hard drive. Unfortunately my ISP didn't support the OS version any more, so I had to rebuild everything for the new OS version.
Anyway, on to the topic of my post. Should 'wsgi.input' become an iterator? Or should we develop a different API for asynchronous applications?
On the positive side of the iterator approach, it could make it easier for asynchronous applications to pause waiting for input, and it could in principle support "chunked" transfer encoding of the input stream.
However, since we last discussed this, I did some Googling on CGI and chunked encoding. By far and away, the most popular links regarding chunked encoding and CGI, are all about bugs in IIS and Apache leading to various vulnerabilities when chunked encoding is used. :(
Once you get past those items (e.g. by adding "-IIS -vulnerability" to your search), you then find *our* discussion here on the Web-SIG! Finally, digging further, I found some 1998 discussion from the IPP (Internet Printing Protocol!) mailing list about what HTTP/1.1 servers support chunked encoding for CGI and which don't.
Anyway, the long and short of it is that CGI and chunked encoding are quite simply incompatible, which means that relying on its availability would be nonportable in a WSGI application anyway.
I don't understand the problem with an iterator on CGI. A CGI script is by definition multi-process. If one block, a new script will be run and anyway the first client will wait... If no one block, an iterator or not will not change anything for him. It will be up to the server to decide if he can use chunked encoding or not. If the script block and doesn't use chunked encoding, it will be not possible to run the script in cgi anyway... I know people who use chunked encoding in cgi, they know what they do and it's fine, i'm sure they will use iterator. I don't see the difference between [sleep...] [sleep...] [sleep...] return data and [sleep...] yield [sleep...] yield [sleep...] yield for a cgi script if it's not possible to don't sleep. -- William Dodé - http://flibuste.net
At 10:33 AM 9/23/04 +0200, William Dode wrote:
I don't see the difference between
[sleep...] [sleep...] [sleep...] return data
and
[sleep...] yield [sleep...] yield [sleep...] yield
for a cgi script if it's not possible to don't sleep.
As previously discussed, the existence of an asynchronous API only matters for asynchronous servers and gateways.
A bit late with the response...but better late than never I hope. ;) On Sep 22, 2004, at 9:56 PM, Phillip J. Eby wrote:
On the positive side of the iterator approach, it could make it easier for asynchronous applications to pause waiting for input, and it could in principle support "chunked" transfer encoding of the input stream.
Anyway, the long and short of it is that CGI and chunked encoding are quite simply incompatible, which means that relying on its availability would be nonportable in a WSGI application anyway.
I do not find that a good reason to copy the mistake (not supporting chunking) to a new API. However! I don't think that the file-like-object API even has a problem with chunked incoming data. As long as WSGI does not make CONTENT_LENGTH a required header, and as long as the result of read looks different for "more data still to come" and "data finished" (it does, blocking for more data to occur vs. returning ''), I think it should be fine (for non-async apps). Am I missing something here?
[...] That means that if we switch from an input stream to an iterator, a lot of people are going to be trying to make sensible wrappers to convert the iterator back to an input stream, and that's just getting ridiculous, [...]
Iterable input stream does seems like it may be a loser for the common case.
So, I'm thinking we should shift the burden to an async-specific API. But, in this case, "burden" means that we get to give asynchronous apps an API much more suited to their use cases. [...] The idea is that this would create an iterator that the server/gateway could recognize as "special", similar to the file-wrapper trick. But, the object returned would provide an extra API for use by the asynchronous application, maybe something like:
put(data) -- queue data for retrieval when the controller is iterated over
finish() -- mark the iterator finished, so it raises StopIteration
on_get(length,callback) -- call 'callback(data)' when 'length' bytes are available on 'wsgi.input' (but return immediately from the 'on_get()' call)
While this API is an optional extension, it seems it would be closer to what some async fans wanted, and less of a kludge. It won't do away with the possibility that middleware might block waiting for input, of course, but when no middleware is present or the middleware isn't transforming the input stream, it should work out quite well.
That sounds okay. I'd specify that the on_get "length" bit is a hint, and may or may not be honored. put/finish is the right API for output (although I'd call it write/finish myself), and on_get seems like the a fairly usable API for input. It doesn't let you pause the incoming data, so if you're passing it on to a slow downstream you'll potentially need to buffer a lot, but maybe that's too much to ask for. I assume callback('') is used to indicate end of incoming data: that should be specified. However, interaction with middleware seems quite tricky here: - For input modifying middleware: I guess on_get would have to just raise an exception if wsgi.input has been replaced. If the input stream was iterable, an on_get callback could just be considered notice that you can iterate the input stream once without blocking, assuming the block boundary requirements were also in effect here. Then it would work right even if the input stream was replaced. However, I think it might be the case that middleware that wants to modify the input stream is so rare, it doesn't really matter. - Output. The block boundary section implies that middleware that follows the guidelines, and doesn't do any blocking operations of its own should work without worrying about the server and application being async or sync. If this is to work, the server cannot expect to actually receive an asyncwrapper iterable as the return value, even if the app is using it, because the middleware might be consuming that iterable and returning one of its own. This means the .put/.next methods should communicate out-of-band, effectively calling pause/resume functions in the server so it knows when it's safe to iterate the vanilla iterator the middleware returned without the middleware blocking when calling the asyncwrapper-iterator.
But if this is the overall right approach, I'd like to drop the current proposals to make 'wsgi.input' an iterator and add optional 'pause'/'resume' APIs, since they were rather kludgy compared to giving async apps their own mini-API for nonblocking I/O.
Perhaps Peter Hunt could try to implement it in his twisted wsgi gateway and see if it works out. :) James
At 12:52 AM 10/5/04 -0400, James Y Knight wrote:
A bit late with the response...but better late than never I hope. ;)
On Sep 22, 2004, at 9:56 PM, Phillip J. Eby wrote:
On the positive side of the iterator approach, it could make it easier for asynchronous applications to pause waiting for input, and it could in principle support "chunked" transfer encoding of the input stream.
Anyway, the long and short of it is that CGI and chunked encoding are quite simply incompatible, which means that relying on its availability would be nonportable in a WSGI application anyway.
I do not find that a good reason to copy the mistake (not supporting chunking) to a new API.
Perhaps not, but there are also lots of other reasons not to support chunked input, mainly that a Google search for "chunked encoding CGI" turns up reams of vulnerabilities that suggest existing HTTP implementations may leave a bit to be desired with respect to accepting a POST of chunked input. :)
However! I don't think that the file-like-object API even has a problem with chunked incoming data. As long as WSGI does not make CONTENT_LENGTH a required header, and as long as the result of read looks different for "more data still to come" and "data finished" (it does, blocking for more data to occur vs. returning ''), I think it should be fine (for non-async apps). Am I missing something here?
I don't think so. Although you probably want something more like a pipe error if the input times out or the connection is broken.
So, I'm thinking we should shift the burden to an async-specific API. But, in this case, "burden" means that we get to give asynchronous apps an API much more suited to their use cases. [...] The idea is that this would create an iterator that the server/gateway could recognize as "special", similar to the file-wrapper trick. But, the object returned would provide an extra API for use by the asynchronous application, maybe something like:
put(data) -- queue data for retrieval when the controller is iterated over
finish() -- mark the iterator finished, so it raises StopIteration
on_get(length,callback) -- call 'callback(data)' when 'length' bytes are available on 'wsgi.input' (but return immediately from the 'on_get()' call)
While this API is an optional extension, it seems it would be closer to what some async fans wanted, and less of a kludge. It won't do away with the possibility that middleware might block waiting for input, of course, but when no middleware is present or the middleware isn't transforming the input stream, it should work out quite well.
That sounds okay. I'd specify that the on_get "length" bit is a hint, and may or may not be honored. put/finish is the right API for output (although I'd call it write/finish myself),
The reason for not using 'write' is to avoid confusion with the existing "write" callable, both in terms of knowing which one we're talking about, and in terms of not confusing the semantics, which may differ subtly between the two.
and on_get seems like the a fairly usable API for input. It doesn't let you pause the incoming data,
Actually it does; it's supposed to be a one-shot. You have to call it again if you want to get called back again.
so if you're passing it on to a slow downstream you'll potentially need to buffer a lot, but maybe that's too much to ask for. I assume callback('') is used to indicate end of incoming data: that should be specified.
I missed that entirely, but it sounds like a good idea.
However, interaction with middleware seems quite tricky here: - For input modifying middleware: I guess on_get would have to just raise an exception if wsgi.input has been replaced.
Yep. Although it might be that the wrapper would just refuse to instantiate in the first place in that circumstance.
If the input stream was iterable, an on_get callback could just be considered notice that you can iterate the input stream once without blocking, assuming the block boundary requirements were also in effect here.
Yes, but this'd only work if the input were an iterator. input.read() returning an empty string would mean EOF, so the boundary stuff doesn't work in that case.
- Output. The block boundary section implies that middleware that follows the guidelines, and doesn't do any blocking operations of its own should work without worrying about the server and application being async or sync. If this is to work, the server cannot expect to actually receive an asyncwrapper iterable as the return value, even if the app is using it, because the middleware might be consuming that iterable and returning one of its own.
Correct.
This means the .put/.next methods should communicate out-of-band, effectively calling pause/resume functions in the server so it knows when it's safe to iterate the vanilla iterator the middleware returned without the middleware blocking when calling the asyncwrapper-iterator.
It could do that, certainly. But, the truth is it's *always* safe to iterate. Note that the application can just use the on_get callback to set a flag that it's ready to continue, and just keep yielding empty strings till then. More to the point, the iterator-wrapper can simply yield empty strings when its internal queue is empty, and a sensible async server should back off its iterator.next() retry attempts when an application yields empty strings. This is pretty much always safe and sensible. However, the out-of-band communication you describe can also take place, since it provides better communication in the case where the extension is available.
At 2004-10-05 12:52 AM -0400, you wrote:
I assume callback('') is used to indicate end of incoming data: that should be specified.
Reasonable assumption. But this is Python; why not callback(None) to indicate no more data? Semantically, None makes more sense here than an empty string. Just my $.02. - Sam __________________________________________________________ Spinward Stars, LLC Samuel Reynolds Software Consulting and Development 303-805-1446 http://SpinwardStars.com/ sam@SpinwardStars.com
participants (4)
-
James Y Knight
-
Phillip J. Eby
-
Samuel Reynolds
-
William Dode