Defers, the reactor, and idiomatic/proper usage -- new user needs some advice?

Hello all, Just getting started with Twisted. Thanks to the community for tremendous work. I diligently went through the entire archives of the mailing list -- reading choice threads -- and I read through all the documentation on the twisted-matrix website. I've never done any event-driven programming, but there was enough on the site for me to start getting a handle on things. Below is a short script that crawls the links from a URI looking for RSS-type feeds. I'm hoping that some of the more experienced developers would be willing to give some advice about whether I'm using twisted.internet.reactor and twisted.web.client.getPage correctly. I've put some comments labeled as #Question where I'm unsure wheter I understand exactly what I'm doing -- Im hoping someone can refute/chastise/critique the understanding that is implied by the code and questions. I have a thick skin so the more talented/vitriolic the response the better. # Find RSS Feeds. # Richard Meraz -- rfmeraz@gmail.com. from twisted.internet import reactor from twisted.web.client import getPage import feedlib # Includes modified version of M. Pilgrims feedfinder.py and modified version of # D. Mertz code for url extraction from p. 228 of __TPIP__. MAXTIME = 60 # Kill crawl after this time TIMEOUT = 20 # Kill page retrieval after this time inactive MAXDEPTH = 3 # Recurse this depth when crawling page. # Question: There seem to be many idioms to aggregate information from different defered call-back chains in twisted.. Since everything runs in a single thread I just stuck my stuff in a global class and everybody modifies the vars there as I pass it around to the call-backs that should see it. Seems okay for a small script like this? class StateVars: '''Keep Global state for starting/stopping feedfinding and a record of links we have checked and their status''' connections = 1 links_checked = {} # Structure: {url: (is RSS/ATOM/RDF, page-content)} # Question: start_feed_crawl is where I set up my defers. getPage returns a defer and I attach my call-back process_link. # addCallbacks adds a callback/errback in parallel so only one or the other is called? so I need to add # the final errback to catch errors from callback process_link ? def start_feed_crawl(uri,depth): '''Harvest feeds from a uri''' # Question: how to time-out this deferred chain if getPage is taking too long to finish its work. # what exactly does the argument timeout to getPage do, does it timeout the socket after a no-response # or does it put an upper-bound on how long getPage has to finish its work? getPage(uri, timeout=TIMEOUT).addCallbacks(callback=process_link, callbackArgs=(uri, depth, StateVars), errback = process_error, errbackArgs=(uri,StateVars) ).addErrback(process_error, uri, StateVars) def process_link(data,uri,depth,state): '''Recursive link processing callback. Determines whether a link is a RSS/ATOM/RDF feed. If not then extracts all xml-like links that could point to feeds and starts crawl on those.''' if feedlib.couldBeFeedData(data): #print 'Feed: %s' % uri state.links_checked[uri] = (True,data) else: state.links_checked[uri] = (False,None) if depth <= MAXDEPTH: alinks = feedlib.getALinks(data,uri) links = feedlib.getLinks(data,uri) rawurls = feedlib.extract_urls(data) links_to_check = [feedlib.makeFullURI(u) for u in set(alinks+links+rawurls) if feedlib.isXMLRelatedLink(u)] for l in links_to_check: # Don't need to see it again. if state.links_checked.has_key(l): continue else: state.connections += 1 # Question: since I'm starting up these defers in a callback they are # being created after I've called reactor.run() since we call start_feed_crawl # as we find new links that meet our criteria. Am I doing anything bad here? # All the examples I've seen (eg. p. 548-552 Python Cookbook, great eg by V. Volonghi # and P. Cogolo) have their data up-front and therefore set-up all the defers before calling # reactor.run(). Here I'm discovering my data as I go along and setting up deferrs while # the reactor is spinning. Here is my fundamental lack of understanding. While this script # seems to run okay, is it okay to do this? start_feed_crawl(l,depth+1) state.connections -= 1 # Question: Is this how I kill the reactor -- ie. using some sort of state condition. Is there a better way, # should I try better to understand deferred-list. For example. A top-level deferred-list that contains # other deferred-lists which get created to hold all the defers (created by start_feed_crawl) for the # links on a given page. Could this deferred-list be told to stop the reactor when the other lists have # fired their callback (after the component defers have finished) ? (Sorry for the convoluted question here # I'm new at this) if state.connections <= 0: reactor.stop() return def process_error(error,uri,state): '''Catch errors in link processing''' state.connections -= 1 if state.connections <= 0: reactor.stop() return '' if __name__ == '__main__': import sys if len(sys.argv) < 2: print 'feedfinder_new.py <uri>' sys.exit() uri = feedlib.makeFullURI(sys.argv[1]) start_feed_crawl(uri,1) # Question: I'm killing the process after a pre-determined amount of time. However reactor.stop() seems # to kill network connections. is there a way to stop the reactor but let the connections finish. # Hack to blow out any connections that are hung or uncalled after MAXTIME reactor.callLater(MAXTIME, reactor.stop) reactor.run() for l in StateVars.links_checked: if StateVars.links_checked[l][0]: print l Final question: occasionally I get errors that come from the http.py code in twisted. This get printed to the console, but don't necessarily stop my program. Should my errbacks be catching these? How do I keep errors from getting logged to the console (beside redirecting stderr). I can post an example if necessary of the errors I'm getting. Thanks for your help. Richard F. Meraz -- Never think there is anything impossible for the soul. It is the greatest heresy to think so. If there is sin, this is the only sin – to say that you are weak, or others are weak. Swami Vivekananda

I'm not familiar with feedlib, etc, but I'll answer what I can. Richard Meraz wrote:
MAXTIME = 60 # Kill crawl after this time TIMEOUT = 20 # Kill page retrieval after this time inactive MAXDEPTH = 3 # Recurse this depth when crawling page.
# Question: There seem to be many idioms to aggregate information from different defered call-back chains in twisted.. Since everything runs in a single thread I just stuck my stuff in a global class and everybody modifies the vars there as I pass it around to the call-backs that should see it. Seems okay for a small script like this?
That seems fine, yeah. I think I would pass around the StateVars instance as a context if I were coding this. Probably the same effect.
class StateVars: '''Keep Global state for starting/stopping feedfinding and a record of links we have checked and their status''' connections = 1 links_checked = {} # Structure: {url: (is RSS/ATOM/RDF, page-content)}
# Question: start_feed_crawl is where I set up my defers. getPage returns a defer and I attach my call-back process_link. # addCallbacks adds a callback/errback in parallel so only one or the other is called? so I need to add # the final errback to catch errors from callback process_link ?
Correct. Well, sort of. See below.
def start_feed_crawl(uri,depth): '''Harvest feeds from a uri''' # Question: how to time-out this deferred chain if getPage is taking too long to finish its work. # what exactly does the argument timeout to getPage do, does it timeout the socket after a no-response # or does it put an upper-bound on how long getPage has to finish its work?
getPage(uri, timeout=TIMEOUT).addCallbacks(callback=process_link, callbackArgs=(uri, depth, StateVars), errback = process_error, errbackArgs=(uri,StateVars) ).addErrback(process_error, uri, StateVars)
It seems clearer to me to write this as follows, but that's personal preference: d = getPage(...) d.addCallbacks(...) d.addErrback(...) But since you're setting up the call to the same errback twice, you could simplify this to: d = getPage(...) d.addCallback(process_link, uri, depth, StateVars) d.addErrback(process_error, uri, StateVars) <http://twistedmatrix.com/projects/core/documentation/howto/defer.html#auto4> has a nice visual explanation of what happens when.
# Question: since I'm starting up these defers in a callback they are # being created after I've called reactor.run() since we call start_feed_crawl # as we find new links that meet our criteria. Am I doing anything bad here? # All the examples I've seen (eg. p. 548-552 Python Cookbook, great eg by V. Volonghi # and P. Cogolo) have their data up-front and therefore set-up all the defers before calling # reactor.run(). Here I'm discovering my data as I go along and setting up deferrs while # the reactor is spinning. Here is my fundamental lack of understanding. While this script # seems to run okay, is it okay to do this?
Yes, that's fine. I think the one you've seen the most is the odd case - being able to set up all the Deferreds beforehand.
# Question: Is this how I kill the reactor -- ie. using some sort of state condition. Is there a better way, # should I try better to understand deferred-list. For example. A top-level deferred-list that contains # other deferred-lists which get created to hold all the defers (created by start_feed_crawl) for the # links on a given page. Could this deferred-list be told to stop the reactor when the other lists have # fired their callback (after the component defers have finished) ? (Sorry for the convoluted question here # I'm new at this)
What you want to do is stop the reactor when everything is done processing. So after you call start_feed_crawl the first time, returning the Deferred that getPage gives you, you can add a callback to that which stops the reactor. The trick here is that if you stuff that deferred into a DeferredList before you add the callback that stops the reactor then if your first operation itself returns a deferred, the DeferredList won't call its callbacks until the other Deferred operation completes. So you'll be stacking up a whole bunch of Deferreds inside the first one, and the callback on the DeferredList that does the reactor.stop won't fire until you don't return a Deferred. There might be an easier way to do this, but this the way I know (example attached). Someone please let me know if there's an easier way. To see the example, run it with 'twistd -noy fetchpage.tac' then do 'telnet localhost 9000' and send: GET /?target=http://www.google.com/ HTTP/1.1 Host: localhost
Final question: occasionally I get errors that come from the http.py code in twisted. This get printed to the console, but don't necessarily stop my program. Should my errbacks be catching these? How do I keep errors from getting logged to the console (beside redirecting stderr). I can post an example if necessary of the errors I'm getting.
When you create the DeferredList, pass in consumeErrors=1 - this will make debugging that much more annoying though... HTH, Dave from twisted.web import server from twisted.web.resource import Resource from twisted.web.client import getPage from twisted.internet import defer, reactor from twisted.python import log from cgi import escape class Foo(Resource): counter = 0 isLeaf=True def render_GET (self, request): self.rq = request target = escape(request.args['target'][0]) d = getPage(target).addCallback(self.print_page) d.addErrback(log.err) dl = defer.DeferredList([d]) dl.addCallback(stopNow) dl.addErrback(log.err) return server.NOT_DONE_YET def print_page (self, html): if Foo.counter < 5: Foo.counter += 1 print 'request '+str(Foo.counter) d = defer.Deferred() d.addCallback(self.print_page) d.addErrback(log.err) reactor.callLater(1, d.callback, html) return d else: print 'now we can write stuff back' self.rq.write(str(len(html))+' '+str(Foo.counter)) self.rq.finish() self.rq.transport.loseConnection() # no deferred being returned, stopNow fires def stopNow(cbval): # can't add reactor.stop as a callback directly # because it doesn't know what to do with the extra # argument being returned from the callback print cbval reactor.stop() resource = Foo() site = server.Site(resource) from twisted.application import service, internet application = service.Application("Foo") internet.TCPServer(9000, site).setServiceParent(application) # vim: ai sts=4 sw=4 expandtab syntax=python :

which stops the reactor. The trick here is that if you stuff that deferred into a DeferredList before you add the callback that stops the reactor then if your first operation itself returns a deferred, the DeferredList won't call its callbacks until the other Deferred operation completes. So you'll be stacking up a whole bunch of Deferreds inside the first one, and the callback on the DeferredList that does the reactor.stop won't fire until you don't return a Deferred.
There might be an easier way to do this, but this the way I know (example attached). Someone please let me know if there's an easier way. [snip] def render_GET (self, request): self.rq = request target = escape(request.args['target'][0]) d = getPage(target).addCallback(self.print_page) d.addCallback(stopNow) d.addErrback(log.err) return server.NOT_DONE_YET
I'm gonna go ahead and answer my own question here - this isn't restricted to DeferredLists, it applies to regular Deferred's too
participants (2)
-
Dave Gray
-
Richard Meraz