[Twisted-Python] Scheduling of modules
I've more or less had twisted sold to me as the be-all and end-all in terms of avoiding having to deal with threads, amongst other things, so I thought I'd ask if twisted is the right thing to use for my problem.
I have a number of modules which use John J Lee's mechanize module to do some web scraping to different web sites. Right now mechanize uses standard urllib calls to do its thing, though it may be possible after much refactoring to get it use something more asyncronous. So for the short term rapid development thing, I'm more or less stuck with threads if I want to get any vaguely concurrent running out of these modules.
Specifically, I'm going to need to poll a database every minute or so (or write some form of SQL Server 2000 trigger which will call me when things change) and if there are any changes in the database, potentially fire off thirty of these web scraping modules at once. The modules are self-contained, and don't communicate with anything other than the thing that calls them by returning values, so I'm vaguely certain they're more or less thread-safe, inasmuch as I ever can be.
My beef is that I can't have more than one scraping module modifying the same site concurrently. This introduces race conditions on the site, rather than in my code. Each module touches only one site, so I need to basically have a module-level lock either in the module or in the thread scheduler to ensure that I'm not running the same module more than once.
This makes me think of some sort of queue structure. I either need to have one queue that just works through its requests ignoring any that are currently running, or one queue per module with some sort of central dispatcher that will place a request in the appropriate queue.
In real terms, these modules may take up to three minutes to complete the web scraping they are required to do, though most take 20 seconds or so. I'd rather not just have them called one after the other in a blocking manner, as I'd sort of like to have a five or six minute response time whenever a request is placed in the database to fire off a bunch of updates, rather than the close to 20 minute response time I'm currently getting when I fire a complete unittest suite off. These requests may come in several times a day, most commonly hours apart, but I need to be able to react if I get two or three different requests within a five minute period, which would mean firing off the next request to the module as soon as it has completed the current request.
Is this something Twisted can help me with? If so, what are my options within Twisted, and what should I be reading up on how to use? I have a vague idea that Twisted has a thread pool, but I'm not sure if it has an event queue that would be suitable for this sort of control, or how I'd go about modifying whatever's there to be useful for this sort of thing.
If not, any pointers to patterns that might help me code such a thing up?
I'm running on Windows. I need no GUI integration as such, though it will need to run as an NT service. Any input other than though the database could potentially be triggered off by a client programme going through something like perspective broker, or the Windows NT service controller telling me to start up or shut down.
Thanks for your help,