bedouglas at earthlink.net
Thu Mar 5 15:59:52 CET 2009
You're missing the issue, so a little clarification...
I've got a number of test parsers that point to a given classlist site.. the
the issue that one faces is that you never "know" if you've gotten all of
the items/links that you're looking for based on the XPath functions. This
could be due to an error in the parsing, or it could be due to an admin
changing the site (removing/adding courses etc...)
So I'm trying to figure out an approach to handling these issues...
As far as I can tell... An approach might be to run the parser script across
the target site X number of times within a narrow timeframe (a few minutes).
Based on the results of this process, you might be able to develop an
overall "tree" of what the actual class/course links/list should be. But you
don't know from hour to hour, day to day if this list is stable, as it could
The only way you know for certain is to physically examine a site. You can't
do this if you're going to develop an automated system for 5-10 sites, or
These are the issues that I'm grappling with.. not how to write the XPath
From: python-list-bounces+bedouglas=earthlink.net at python.org
[mailto:python-list-bounces+bedouglas=earthlink.net at python.org]On Behalf
Of John Nagle
Sent: Wednesday, March 04, 2009 10:23 PM
To: python-list at python.org
Subject: Re: Parsing/Crawler Questions..
> hi phillip...
> thanks for taking a sec to reply...
> i'm solid on the test app i've created.. but as an example.. i have a
> for usc (southern cal) and it exrtacts the courselist/class schedule... my
> issue was that i realized the multiple runs of the app was giving
> results... in my case, the class schedule isn't static.. (actually, none
> the class/course lists need be static.. they could easily change).
> so i don't have apriori knowledge of what the actual class/course list
> would look like, unless i physically examined the site, each time i run
> i'm inclined to think i might need to run the parser a number of times
> within a given time frame, and then take a union/join of the output of the
> different runs.. this would in theory, give me a high probablity that i'd
> get 100% of the class list...
I think I see the problem. I took a look at the USC class list, and
it's been made "Web 2.0". When you read the page, you don't get the
demand, using JSON, no less.
has to be run before you get anything.
More information about the Python-list