Mailman 3 Defers, the reactor, and idiomatic/proper usage -- new user needs some advice? - Twisted-web

21 Jun 2005

      Hello all,

Just getting started with Twisted. Thanks to the community for tremendous 
work. I diligently went through the entire archives of the mailing list -- 
reading choice threads -- and I read through all the documentation on the 
twisted-matrix website. I've never done any event-driven programming, but 
there was enough on the site for me to start getting a handle on things. 
Below is a short script that crawls the links from a URI looking for 
RSS-type feeds. I'm hoping that some of the more experienced developers 
would be willing to give some advice about whether I'm using 
twisted.internet.reactor and twisted.web.client.getPage correctly. I've put 
some comments labeled as #Question where I'm unsure wheter I understand 
exactly what I'm doing -- Im hoping someone can refute/chastise/critique the 
understanding that is implied by the code and questions. I have a thick skin 
so the more talented/vitriolic the response the better. 

# Find RSS Feeds.
# Richard Meraz -- rfmeraz@gmail.com. 
from twisted.internet import reactor
from twisted.web.client import getPage

import feedlib # Includes modified version of M. Pilgrims feedfinder.py and 
modified version of
# D. Mertz code for url extraction from p. 228 of __TPIP__. 

MAXTIME = 60 # Kill crawl after this time
TIMEOUT = 20 # Kill page retrieval after this time inactive
MAXDEPTH = 3 # Recurse this depth when crawling page.

# Question: There seem to be many idioms to aggregate information from 
different defered call-back chains in twisted.. Since everything runs in a 
single thread I just stuck my stuff in a global class and everybody modifies 
the vars there as I pass it around to the call-backs that should see it. 
Seems okay for a small script like this?

class StateVars:
'''Keep Global state for starting/stopping feedfinding and a record of links 
we have checked and their status'''
connections = 1 
links_checked = {} # Structure: {url: (is RSS/ATOM/RDF, page-content)}

# Question: start_feed_crawl is where I set up my defers. getPage returns a 
defer and I attach my call-back process_link.
# addCallbacks adds a callback/errback in parallel so only one or the other 
is called? so I need to add
# the final errback to catch errors from callback process_link ?

def start_feed_crawl(uri,depth):
'''Harvest feeds from a uri'''
# Question: how to time-out this deferred chain if getPage is taking too 
long to finish its work.
# what exactly does the argument timeout to getPage do, does it timeout the 
socket after a no-response
# or does it put an upper-bound on how long getPage has to finish its work?

getPage(uri, timeout=TIMEOUT).addCallbacks(callback=process_link,
callbackArgs=(uri, depth, StateVars),
errback = process_error,
errbackArgs=(uri,StateVars)
).addErrback(process_error, uri, StateVars)

def process_link(data,uri,depth,state):
'''Recursive link processing callback. Determines whether a link
is a RSS/ATOM/RDF feed. If not then extracts all xml-like links
that could point to feeds and starts crawl on those.'''
if feedlib.couldBeFeedData(data):
#print 'Feed: %s' % uri
state.links_checked[uri] = (True,data)
else:
state.links_checked[uri] = (False,None)
if depth <= MAXDEPTH:
alinks = feedlib.getALinks(data,uri)
links = feedlib.getLinks(data,uri)
rawurls = feedlib.extract_urls(data) 
links_to_check = [feedlib.makeFullURI(u)
for u in set(alinks+links+rawurls)
if feedlib.isXMLRelatedLink(u)]
for l in links_to_check:
# Don't need to see it again.
if state.links_checked.has_key(l):
continue
else:
state.connections += 1
# Question: since I'm starting up these defers in a callback they are 
# being created after I've called reactor.run() since we call 
start_feed_crawl
# as we find new links that meet our criteria. Am I doing anything bad here?
# All the examples I've seen (eg. p. 548-552 Python Cookbook, great eg by V. 
Volonghi
# and P. Cogolo) have their data up-front and therefore set-up all the 
defers before calling
# reactor.run(). Here I'm discovering my data as I go along and setting up 
deferrs while
# the reactor is spinning. Here is my fundamental lack of understanding. 
While this script
# seems to run okay, is it okay to do this?
start_feed_crawl(l,depth+1)

state.connections -= 1

# Question: Is this how I kill the reactor -- ie. using some sort of state 
condition. Is there a better way,
# should I try better to understand deferred-list. For example. A top-level 
deferred-list that contains
# other deferred-lists which get created to hold all the defers (created by 
start_feed_crawl) for the
# links on a given page. Could this deferred-list be told to stop the 
reactor when the other lists have
# fired their callback (after the component defers have finished) ? (Sorry 
for the convoluted question here
# I'm new at this)

if state.connections <= 0:
reactor.stop()
return

def process_error(error,uri,state):
'''Catch errors in link processing'''
state.connections -= 1
if state.connections <= 0:
reactor.stop()
return ''

if __name__ == '__main__':
import sys
if len(sys.argv) < 2:
print 'feedfinder_new.py <uri>'
sys.exit()

uri = feedlib.makeFullURI(sys.argv[1])
start_feed_crawl(uri,1)

# Question: I'm killing the process after a pre-determined amount of time. 
However reactor.stop() seems
# to kill network connections. is there a way to stop the reactor but let 
the connections finish.

# Hack to blow out any connections that are hung or uncalled after MAXTIME
reactor.callLater(MAXTIME, reactor.stop)
reactor.run()

for l in StateVars.links_checked:
if StateVars.links_checked[l][0]:
print l

Final question: occasionally I get errors that come from the http.py code in 
twisted. This get printed to the console, but don't necessarily stop my 
program. Should my errbacks be catching these? How do I keep errors from 
getting logged to the console (beside redirecting stderr). I can post an 
example if necessary of the errors I'm getting.

Thanks for your help.

Richard F. Meraz

-- 
Never think there is anything impossible for the soul. It is the greatest 
heresy to think so. If there is sin, this is the only sin – to say that you 
are weak, or others are weak.

Swami Vivekananda

Defers, the reactor, and idiomatic/proper usage -- new user needs some advice?

Richard Meraz

tags

participants (2)