Parsing/Crawler Questions..

Wed Mar 4 23:07:54 EST 2009

hi phillip...

thanks for taking a sec to reply...

i'm solid on the test app i've created.. but as an example.. i have a parse
for usc (southern cal) and it exrtacts the courselist/class schedule... my
issue was that i realized the multiple runs of the app was giving differentt
results... in my case, the class schedule isn't static.. (actually, none of
the class/course lists need be static.. they could easily change).

so i don't have apriori knowledge of what the actual class/course list site
would look like, unless i physically examined the site, each time i run the
app...

i'm inclined to think i might need to run the parser a number of times
within a given time frame, and then take a union/join of the output of the
different runs.. this would in theory, give me a high probablity that i'd
get 100% of the class list...

most crawlers, and most research that i've seen focus on the indexing, or
crawling function/architecture.. haven't really seen any
articles/research/pointers dealing with this kind of process...

thoughts/comments are welcome..

thanks

-----Original Message-----
From: python-list-bounces+bedouglas=earthlink.net at python.org
[mailto:python-list-bounces+bedouglas=earthlink.net at python.org]On Behalf
Of Philip Semanchuk
Sent: Wednesday, March 04, 2009 6:15 PM
To: python-list (General)
Subject: Re: Parsing/Crawler Questions..

On Mar 4, 2009, at 4:44 PM, bruce wrote:

> Hi...
>
> Sorry that this is a bit off track. Ok, maybe way off track!
>
> But I don't have anyone to bounce this off of..
>
> I'm working on a crawling project, crawling a college website, to
> extract
> course/class information. I've built a quick test app in python to
> crawl the
> site. I crawl at the top level, and work my way down to getting the
> required
> course/class schedule. The app works. I can consistently run it and
> extract
> the information. The required information is based upon an XPath
> analysis of
> the DOM for the given pages that I'm parsing.
>
> My issue is now that I have a "basic" app that works, I need to
> figure out
> how I guarantee that I'm correctly crawling the site. How do I know
> when
> I've got an error at a given node/branch, so that the app knows that
> it's
> not going to fetch the underlying branch/nodes of the tree..
>
> When running the app, I can get 5000 classes on one run, 4700 on
> antoher,
> etc... So I need some method of determining when I get a "complete"
> tree...
>
> How do I know when I have a complete "tree"!

hi Bruce,
To put this another way, you're trying to convince yourself that your
program is correct, yes? For instance, you're worried that you might
be doing something like discovering a URL on a site but failing to
pursue that URL, yes?

The standard way of testing any program is to test known input and
look for expected output. Repeat as necessary. In your case that would
mean crawling a site where you know all of the URLs and to see if your
program finds them all. And that, of course, isn't proof of
correctness, it just means that that particular site didn't trigger
any error conditions that would cause your program to misbehave.

I think every modern OS makes it easy to run a Web server on your
local machine. You might want to set up suite of test sites on your
machine and point your program at localhost. That way you can build a
site to test your application in areas you fear it may be weak.

I'm unclear on what you're using to parse the pages, but (X)HTML is
very often invalid in the strict sense of validity. If the tools
you're using expect/insist on well-formed XML or valid HTML, they'll
be disappointed on most sites and you'll probably be missing URLs. The
canonical solution for parsing real-world Web pages with Python is
BeautifulSoup.

HTH
Philip

--
http://mail.python.org/mailman/listinfo/python-list