Defers, the reactor, and idiomatic/proper usage -- new user needs some advice?
Hello all, Just getting started with Twisted. Thanks to the community for tremendous work. I diligently went through the entire archives of the mailing list -- reading choice threads -- and I read through all the documentation on the twisted-matrix website. I've never done any event-driven programming, but there was enough on the site for me to start getting a handle on things. Below is a short script that crawls the links from a URI looking for RSS-type feeds. I'm hoping that some of the more experienced developers would be willing to give some advice about whether I'm using twisted.internet.reactor and twisted.web.client.getPage correctly. I've put some comments labeled as #Question where I'm unsure wheter I understand exactly what I'm doing -- Im hoping someone can refute/chastise/critique the understanding that is implied by the code and questions. I have a thick skin so the more talented/vitriolic the response the better. # Find RSS Feeds. # Richard Meraz -- rfmeraz@gmail.com. from twisted.internet import reactor from twisted.web.client import getPage import feedlib # Includes modified version of M. Pilgrims feedfinder.py and modified version of # D. Mertz code for url extraction from p. 228 of __TPIP__. MAXTIME = 60 # Kill crawl after this time TIMEOUT = 20 # Kill page retrieval after this time inactive MAXDEPTH = 3 # Recurse this depth when crawling page. # Question: There seem to be many idioms to aggregate information from different defered call-back chains in twisted.. Since everything runs in a single thread I just stuck my stuff in a global class and everybody modifies the vars there as I pass it around to the call-backs that should see it. Seems okay for a small script like this? class StateVars: '''Keep Global state for starting/stopping feedfinding and a record of links we have checked and their status''' connections = 1 links_checked = {} # Structure: {url: (is RSS/ATOM/RDF, page-content)} # Question: start_feed_crawl is where I set up my defers. getPage returns a defer and I attach my call-back process_link. # addCallbacks adds a callback/errback in parallel so only one or the other is called? so I need to add # the final errback to catch errors from callback process_link ? def start_feed_crawl(uri,depth): '''Harvest feeds from a uri''' # Question: how to time-out this deferred chain if getPage is taking too long to finish its work. # what exactly does the argument timeout to getPage do, does it timeout the socket after a no-response # or does it put an upper-bound on how long getPage has to finish its work? getPage(uri, timeout=TIMEOUT).addCallbacks(callback=process_link, callbackArgs=(uri, depth, StateVars), errback = process_error, errbackArgs=(uri,StateVars) ).addErrback(process_error, uri, StateVars) def process_link(data,uri,depth,state): '''Recursive link processing callback. Determines whether a link is a RSS/ATOM/RDF feed. If not then extracts all xml-like links that could point to feeds and starts crawl on those.''' if feedlib.couldBeFeedData(data): #print 'Feed: %s' % uri state.links_checked[uri] = (True,data) else: state.links_checked[uri] = (False,None) if depth <= MAXDEPTH: alinks = feedlib.getALinks(data,uri) links = feedlib.getLinks(data,uri) rawurls = feedlib.extract_urls(data) links_to_check = [feedlib.makeFullURI(u) for u in set(alinks+links+rawurls) if feedlib.isXMLRelatedLink(u)] for l in links_to_check: # Don't need to see it again. if state.links_checked.has_key(l): continue else: state.connections += 1 # Question: since I'm starting up these defers in a callback they are # being created after I've called reactor.run() since we call start_feed_crawl # as we find new links that meet our criteria. Am I doing anything bad here? # All the examples I've seen (eg. p. 548-552 Python Cookbook, great eg by V. Volonghi # and P. Cogolo) have their data up-front and therefore set-up all the defers before calling # reactor.run(). Here I'm discovering my data as I go along and setting up deferrs while # the reactor is spinning. Here is my fundamental lack of understanding. While this script # seems to run okay, is it okay to do this? start_feed_crawl(l,depth+1) state.connections -= 1 # Question: Is this how I kill the reactor -- ie. using some sort of state condition. Is there a better way, # should I try better to understand deferred-list. For example. A top-level deferred-list that contains # other deferred-lists which get created to hold all the defers (created by start_feed_crawl) for the # links on a given page. Could this deferred-list be told to stop the reactor when the other lists have # fired their callback (after the component defers have finished) ? (Sorry for the convoluted question here # I'm new at this) if state.connections <= 0: reactor.stop() return def process_error(error,uri,state): '''Catch errors in link processing''' state.connections -= 1 if state.connections <= 0: reactor.stop() return '' if __name__ == '__main__': import sys if len(sys.argv) < 2: print 'feedfinder_new.py <uri>' sys.exit() uri = feedlib.makeFullURI(sys.argv[1]) start_feed_crawl(uri,1) # Question: I'm killing the process after a pre-determined amount of time. However reactor.stop() seems # to kill network connections. is there a way to stop the reactor but let the connections finish. # Hack to blow out any connections that are hung or uncalled after MAXTIME reactor.callLater(MAXTIME, reactor.stop) reactor.run() for l in StateVars.links_checked: if StateVars.links_checked[l][0]: print l Final question: occasionally I get errors that come from the http.py code in twisted. This get printed to the console, but don't necessarily stop my program. Should my errbacks be catching these? How do I keep errors from getting logged to the console (beside redirecting stderr). I can post an example if necessary of the errors I'm getting. Thanks for your help. Richard F. Meraz -- Never think there is anything impossible for the soul. It is the greatest heresy to think so. If there is sin, this is the only sin – to say that you are weak, or others are weak. Swami Vivekananda
participants (2)
-
Dave Gray
-
Richard Meraz