On Thu, 18 Oct 2007 14:41:38 +0200, Jürgen Strass
Hello,
I'm rather new to twisted and asynchronous programming in general. Overall, I think I've understood the asynchronous programming model and its implications quite well. Nevertheless, there are some remaining questions.
To give some example, I'd like to develop my own simplified document format in XML and a corresponding parser. The output of the parser (a specialized document object model) will be traversed and translated into HTML afterwards. This module could be useful outside any twisted application, of course. Instead of generating HTML one could develop a generator that produces LaTeX, for example. But it could also be used to render HTML pages in a twisted web application.
Have you seen Lore?
The question is this: since parsing and generating large documents could block the reactor in a twisted app, should I use any of twisted's asynchronous programming features in this module (for better integration with twisted) or should I rather develop it in a traditional way and run it in a thread?
Incremental parsing is often useful and simpler than the alternative. If you are accepting a document over the network, why buffer it yourself and then parse it when you could just be giving each piece directly to the parser? Done this way, it often is the case that even large documents can be parsed without blocking for an unreasonable amount of time.
The question came to my mind, because somewhere I read that long lasting operations in third party modules should be called in a thread. This is clear. I also read that if one has the opportunity to develop an application from scratch, one should rather go for using twisted's asynchronous programming features and divide long lasting operations into small chunks.
The CPU differs from the network. There are rarely points in a CPU-bound task where suspending to work on something else would not be an arbitrary decision. When dealing with the network, these points are obvious and not at all arbitrary. So, when dealing with the network, it's almost unarguable that you should use Twisted's APIs instead of using blocking APIs. However, Twisted doesn't provide any functionality specifically for breaking up CPU-bound tasks, primarily because any such functionality would be arbitrary.
In principal, this approach is clear to me, but does it also apply for modules which are entirely independent from twisted networking code? And if so, is there any way to decouple them from the twisted library for reuse in other applications?
It's typically trivial to drive code written to be used asynchronously in a synchronous manner. The opposite is rarely, if ever, true. Consider a parser API which consists of a "feed" method taking a string giving some more bytes from the input document. You can use this by passing in small chunks repeatedly until the entire document has been passed in, or you can pass in the entire document at once. Now consider an API where the entire document must be supplied at once: how do you use that without blocking?
The last question is what criteria I could use to divide long lasting operations into chunks. In almost all books about asynchronous programming I only read that if they're too big, they could block the event loop. Of course, but how big is too big? And what's the measure for it? Milliseconds, number of operations, number of code lines - or what? Doesn't it depend entirely on the application at hand and how reactive it has to be?
Yes.
Moreover, depending on the hardware used, on a Pentium II less chunks can be processed at the same time than on a Athlon 64, for example.
True as well. However, is your primary goal to provide ideal scheduling behavior both on a CPU released this year and a CPU released ten years ago?
And couldn't chunks also be too small, spending more time than necessary in putting them into the reactor's queue, then maybe sorting them and then calling them? In case the overhead involved in scheduling some chunk is bigger than the processing time of the chunk itself, the chunks are too small, aren't they?
Correct again. These problems can all be mitigated, at least partially, by allowing the application to decide how much work is done at once. Parsing one byte from an input document should take less time than parsing one megabyte. Let the application decide how much work is done at a time. Size of input is only one way in which this can be controlled. You could support explicit tuning of these parameters with a dedicated API, or you could support stepwise processing and let the application explicitly step it as far as it wants to at a time. In this direction, there are some extremely primitive tools in twisted.internet.task. They will not solve the problem for you, but they may give you some ideas or save you a bit of typing.
Thanks in advance for any answers, Jürgen
Jean-Paul