Parsing/Crawler Questions - solution

Thu Mar 5 17:50:37 EST 2009

hi john...

update...

further investigation has revealed that apparently, for some urls/sites, the
server serves up pages that take awhile to be fetched... this appears to be
a potential problem, in that it appears that the parsescript never gets
anything from the python mech/urllib read function.

the curious issue is that i can run a single test script, pointing to the
url, and after a bit of time.. the resulting content is fetched/downloaded
correctly. by the way, i can get the same results in my test browsing
environment, if i start it with only a subset of the urs that i've been
using to test the app.

hmm... might be a resource issue, a timing issue,.. or something else...
hmmm...

thanks

again.... the problem i'm facing really has nothing to do with a specific
url... the app i have for the usc site works...

but for any number of reasons... you might get different results when
running the app..
-the server could be screwed up..
-data might be cached
-data might be changed, and not updated..
-actual app problems...
-networking issues...
-memory corruption issues...
-process constraint issues..
-web server overload..
-etc...

the assumption that most people appear to make is that if you create a
parser, and run and test it once.. then if it gets you the data, it's
working.. when you run the same app.. 100s of times, and you're slamming the
webserver... then you realize that that's a vastly different animal than
simply running a snigle query a few times...

so.. nope, i'm not running the app and getting data from a dynamic page that
hasn't finished uploading/creating the content..

but what my analysis is showing, not only for the usc, but for others as
well.. is that there might be differences in what gets returned...

which is where a smoothing algorithmic approach appears to be workable..

i've been starting to test this approach, and it actually might have a
chance of working...

so.. as i've stated a number of times.. focusing on a specific url isn't the
issue.. the larger issue is how you can
programatically/algorithmically/automatically, be reasonably ensured that
what you have is exactly what's on the site...

ain't screen scraping fun!!!

-----Original Message-----
From: python-list-bounces+bedouglas=earthlink.net at python.org
[mailto:python-list-bounces+bedouglas=earthlink.net at python.org]On Behalf
Of John Nagle
Sent: Thursday, March 05, 2009 10:54 AM
To: python-list at python.org
Subject: Re: Parsing/Crawler Questions - solution

Philip Semanchuk wrote:
> On Mar 5, 2009, at 12:31 PM, bruce wrote:
>
>> hi..
>>
>> the url i'm focusing on is irrelevant to the issue i'm trying to solve at
>> this time.
>
> Not if we're to understand the situation you're trying to describe. From
> what I can tell, you're saying that the target site displays different
> results each time your crawler visits it. It's as if e.g. the site knows
> about 100 courses but only displays 80 randomly chosen ones to each
> visitor. If that's the case, then it is truly bizarre.

     Agreed.  The course list isn't changing that rapidly.

     I suspect the original poster is doing something like reading the DOM
of a dynamic page while the page is still updating, running a browser
in a subprocess.  Is that right?

     I've had to deal with that in Javascript.  My AdRater browser plug-in
(http://www.sitetruth.com/downloads) looks at Google-served ads and
rates the advertisers.   There, I have to watch for page-change events
and update the annotations I'm adding to ads.

     But you don't need to work that hard here. The USC site is actually
querying a server which provides the requested data in JSON format.  See

	http://web-app.usc.edu/soc/dev/scripts/soc.js

Reverse-engineer that and you'll be able to get the underlying data.
(It's an amusing script; many little fixes to data items are performed,
something that should have been done at the database front end.)

The way to get USC class data is this:

1.  Start here: "http://web-app.usc.edu/soc/term_20091.html"
2.  Examine all the department pages under that page.
3.  On each page, look for the value of "coursesrc", like this:
	var coursesrc = '/ws/soc/api/classes/aest/20091'
4.  For each "coursesrc" value found, construct a URL like this:
	http://web-app.usc.edu/ws/soc/api/classes/aest/20091
5.  Read that URL.  This will return the department's course list in
     JSON format.
6.  From the JSON tree, pull out CourseData items, which look like this:

CourseData":
{"prefix":"AEST",
"number":"220",
"sequence":"B",
"suffix":{},
"title":"Advanced Leadership Laboratory II",
"description":"Additional exposure to the military experience for continuing
AFROTC cadets, focusing on customs and courtesies, drill and ceremonies, and
the
environment of an Air Force officer. Credit\/No Credit.",
"units":"1",
"restriction_by_major":{},
"restriction_by_class":{},
"restriction_by_school":{},
"CourseNotes":{},
"CourseTermNotes":{},
"prereq_text":"AEST-220A",
"coreq_text":{},
"SectionData":{"id":"41799",
"session":"790",
"dclass_code":"D",
"title":"Advanced Leadership Laboratory II",
"section_title":{},
"description":{},
"notes":{},
"type":"Lec",
"units":"1",
"spaces_available":"30",
"number_registered":"2",
"wait_qty":"0",
"canceled":"N",
"blackboard":"Y",
"comment":{},
"day":{},"start_time":"TBA",
"end_time":"TBA",
"location":"OFFICE",
"instructor":{"last_name":"Hampton","first_name":"Daniel"},
"syllabus":{"format":{},"filesize":{}},
"IsDistanceLearning":"N"}}},

Parsing the JSON is left as an exercise for the student.  (There's
a Python module for that.)

And no, the data isn't changing; you can read those pages of JSON over and
over and get the same data every time.

					John Nagle
--
http://mail.python.org/mailman/listinfo/python-list