bedouglas at earthlink.net
Thu Mar 5 18:31:23 CET 2009
the url i'm focusing on is irrelevant to the issue i'm trying to solve at
i think an approach will be to fire up a number of parsing attempts, and to
track the returned depts/classes/etc... in theory (hopefully) i should be
able to create a process to build a kind of statistical representation of
what the site looks like (names of depts, names/number of classes for given
depts, etc..) if i'm correct, this would provide a complete
"list/understanding" of what the courselist looks like....
i could then run the parsing process a number of times, examining the actual
value/results for the query, and taking the highest/oldest values for the
given query.. the idea being that the app will return correct results for
most of the queries, most of the time.. so from a statistical basis.. i can
take the results that are returned with the highest frequency...
so this approach might work. but again, haven't seen anything in the
literature/'net that talks about this...
From: python-list-bounces+bedouglas=earthlink.net at python.org
[mailto:python-list-bounces+bedouglas=earthlink.net at python.org]On Behalf
Of John Nagle
Sent: Thursday, March 05, 2009 8:38 AM
To: python-list at python.org
Subject: Re: Parsing/Crawler Questions..
> hi john..
> You're missing the issue, so a little clarification...
> I've got a number of test parsers that point to a given classlist site..
> scripts work.
> the issue that one faces is that you never "know" if you've gotten all of
> the items/links that you're looking for based on the XPath functions. This
> could be due to an error in the parsing, or it could be due to an admin
> changing the site (removing/adding courses etc...)
What URLs are you looking at?
More information about the Python-list