philip at semanchuk.com
Thu Mar 5 18:36:05 CET 2009
On Mar 5, 2009, at 12:31 PM, bruce wrote:
> the url i'm focusing on is irrelevant to the issue i'm trying to
> solve at
> this time.
Not if we're to understand the situation you're trying to describe.
From what I can tell, you're saying that the target site displays
different results each time your crawler visits it. It's as if e.g.
the site knows about 100 courses but only displays 80 randomly chosen
ones to each visitor. If that's the case, then it is truly bizarre.
> i think an approach will be to fire up a number of parsing attempts,
> and to
> track the returned depts/classes/etc... in theory (hopefully) i
> should be
> able to create a process to build a kind of statistical
> representation of
> what the site looks like (names of depts, names/number of classes
> for given
> depts, etc..) if i'm correct, this would provide a complete
> "list/understanding" of what the courselist looks like....
> i could then run the parsing process a number of times, examining
> the actual
> value/results for the query, and taking the highest/oldest values
> for the
> given query.. the idea being that the app will return correct
> results for
> most of the queries, most of the time.. so from a statistical
> basis.. i can
> take the results that are returned with the highest frequency...
> so this approach might work. but again, haven't seen anything in the
> literature/'net that talks about this...
> -----Original Message-----
> From: python-list-bounces+bedouglas=earthlink.net at python.org
> [mailto:python-list-bounces+bedouglas=earthlink.net at python.org]On
> Of John Nagle
> Sent: Thursday, March 05, 2009 8:38 AM
> To: python-list at python.org
> Subject: Re: Parsing/Crawler Questions..
> bruce wrote:
>> hi john..
>> You're missing the issue, so a little clarification...
>> I've got a number of test parsers that point to a given classlist
>> scripts work.
>> the issue that one faces is that you never "know" if you've gotten
>> all of
>> the items/links that you're looking for based on the XPath
>> functions. This
>> could be due to an error in the parsing, or it could be due to an
>> changing the site (removing/adding courses etc...)
> What URLs are you looking at?
> John Nagle
More information about the Python-list