google at mrabarnett.plus.com
Wed Mar 4 23:19:20 CET 2009
> Sorry that this is a bit off track. Ok, maybe way off track!
> But I don't have anyone to bounce this off of..
> I'm working on a crawling project, crawling a college website, to extract
> course/class information. I've built a quick test app in python to crawl the
> site. I crawl at the top level, and work my way down to getting the required
> course/class schedule. The app works. I can consistently run it and extract
> the information. The required information is based upon an XPath analysis of
> the DOM for the given pages that I'm parsing.
> My issue is now that I have a "basic" app that works, I need to figure out
> how I guarantee that I'm correctly crawling the site. How do I know when
> I've got an error at a given node/branch, so that the app knows that it's
> not going to fetch the underlying branch/nodes of the tree..
If you were crawling the site yourself, how would _you_ know when you
had an error at a given node/branch?
More information about the Python-list