Parsing/Crawler Questions..

Thu Mar 5 09:59:52 EST 2009

hi john..

You're missing the issue, so a little clarification...

I've got a number of test parsers that point to a given classlist site.. the
scripts work.

the issue that one faces is that you never "know" if you've gotten all of
the items/links that you're looking for based on the XPath functions. This
could be due to an error in the parsing, or it could be due to an admin
changing the site (removing/adding courses etc...)

So I'm trying to figure out an approach to handling these issues...

As far as I can tell... An approach might be to run the parser script across
the target site X number of times within a narrow timeframe (a few minutes).
Based on the results of this process, you might be able to develop an
overall "tree" of what the actual class/course links/list should be. But you
don't know from hour to hour, day to day if this list is stable, as it could
change..

The only way you know for certain is to physically examine a site. You can't
do this if you're going to develop an automated system for 5-10 sites, or
for 500-1000...

These are the issues that I'm grappling with.. not how to write the XPath
parsing functions...

Thanks..

-----Original Message-----
From: python-list-bounces+bedouglas=earthlink.net at python.org
[mailto:python-list-bounces+bedouglas=earthlink.net at python.org]On Behalf
Of John Nagle
Sent: Wednesday, March 04, 2009 10:23 PM
To: python-list at python.org
Subject: Re: Parsing/Crawler Questions..

bruce wrote:
> hi phillip...
>
> thanks for taking a sec to reply...
>
> i'm solid on the test app i've created.. but as an example.. i have a
parse
> for usc (southern cal) and it exrtacts the courselist/class schedule... my
> issue was that i realized the multiple runs of the app was giving
differentt
> results... in my case, the class schedule isn't static.. (actually, none
of
> the class/course lists need be static.. they could easily change).
>
> so i don't have apriori knowledge of what the actual class/course list
site
> would look like, unless i physically examined the site, each time i run
the
> app...
>
> i'm inclined to think i might need to run the parser a number of times
> within a given time frame, and then take a union/join of the output of the
> different runs.. this would in theory, give me a high probablity that i'd
> get 100% of the class list...

     I think I see the problem.  I took a look at the USC class list, and
it's been made "Web 2.0".  When you read the page, you don't get the
class list; you get a Javascript thing that builds a class list on
demand, using JSON, no less.

     See "http://web-app.usc.edu/soc/term_20091.html".

     I'm not sure how you're handling this.  The Javascript actually
has to be run before you get anything.

				John Nagle
--
http://mail.python.org/mailman/listinfo/python-list