Parsing/Crawler Questions - solution
John Nagle
nagle at animats.com
Thu Mar 5 13:54:07 EST 2009
Philip Semanchuk wrote:
> On Mar 5, 2009, at 12:31 PM, bruce wrote:
>
>> hi..
>>
>> the url i'm focusing on is irrelevant to the issue i'm trying to solve at
>> this time.
>
> Not if we're to understand the situation you're trying to describe. From
> what I can tell, you're saying that the target site displays different
> results each time your crawler visits it. It's as if e.g. the site knows
> about 100 courses but only displays 80 randomly chosen ones to each
> visitor. If that's the case, then it is truly bizarre.
Agreed. The course list isn't changing that rapidly.
I suspect the original poster is doing something like reading the DOM
of a dynamic page while the page is still updating, running a browser
in a subprocess. Is that right?
I've had to deal with that in Javascript. My AdRater browser plug-in
(http://www.sitetruth.com/downloads) looks at Google-served ads and
rates the advertisers. There, I have to watch for page-change events
and update the annotations I'm adding to ads.
But you don't need to work that hard here. The USC site is actually
querying a server which provides the requested data in JSON format. See
http://web-app.usc.edu/soc/dev/scripts/soc.js
Reverse-engineer that and you'll be able to get the underlying data.
(It's an amusing script; many little fixes to data items are performed,
something that should have been done at the database front end.)
The way to get USC class data is this:
1. Start here: "http://web-app.usc.edu/soc/term_20091.html"
2. Examine all the department pages under that page.
3. On each page, look for the value of "coursesrc", like this:
var coursesrc = '/ws/soc/api/classes/aest/20091'
4. For each "coursesrc" value found, construct a URL like this:
http://web-app.usc.edu/ws/soc/api/classes/aest/20091
5. Read that URL. This will return the department's course list in
JSON format.
6. From the JSON tree, pull out CourseData items, which look like this:
CourseData":
{"prefix":"AEST",
"number":"220",
"sequence":"B",
"suffix":{},
"title":"Advanced Leadership Laboratory II",
"description":"Additional exposure to the military experience for continuing
AFROTC cadets, focusing on customs and courtesies, drill and ceremonies, and the
environment of an Air Force officer. Credit\/No Credit.",
"units":"1",
"restriction_by_major":{},
"restriction_by_class":{},
"restriction_by_school":{},
"CourseNotes":{},
"CourseTermNotes":{},
"prereq_text":"AEST-220A",
"coreq_text":{},
"SectionData":{"id":"41799",
"session":"790",
"dclass_code":"D",
"title":"Advanced Leadership Laboratory II",
"section_title":{},
"description":{},
"notes":{},
"type":"Lec",
"units":"1",
"spaces_available":"30",
"number_registered":"2",
"wait_qty":"0",
"canceled":"N",
"blackboard":"Y",
"comment":{},
"day":{},"start_time":"TBA",
"end_time":"TBA",
"location":"OFFICE",
"instructor":{"last_name":"Hampton","first_name":"Daniel"},
"syllabus":{"format":{},"filesize":{}},
"IsDistanceLearning":"N"}}},
Parsing the JSON is left as an exercise for the student. (There's
a Python module for that.)
And no, the data isn't changing; you can read those pages of JSON over and
over and get the same data every time.
John Nagle
More information about the Python-list
mailing list