Parsing/Crawler Questions..

MRAB google at mrabarnett.plus.com
Wed Mar 4 17:19:20 EST 2009


bruce wrote:
> Hi...
> 
> Sorry that this is a bit off track. Ok, maybe way off track!
> 
> But I don't have anyone to bounce this off of..
> 
> I'm working on a crawling project, crawling a college website, to extract
> course/class information. I've built a quick test app in python to crawl the
> site. I crawl at the top level, and work my way down to getting the required
> course/class schedule. The app works. I can consistently run it and extract
> the information. The required information is based upon an XPath analysis of
> the DOM for the given pages that I'm parsing.
> 
> My issue is now that I have a "basic" app that works, I need to figure out
> how I guarantee that I'm correctly crawling the site. How do I know when
> I've got an error at a given node/branch, so that the app knows that it's
> not going to fetch the underlying branch/nodes of the tree..
> 
[snip]
If you were crawling the site yourself, how would _you_ know when you
had an error at a given node/branch?




More information about the Python-list mailing list