[Tutor] can I walk or glob a website?

Wed May 18 13:23:02 CEST 2011

From: Dave Angel <davea at ieee.org>
To: Alan Gauld <alan.gauld at btinternet.com>
Cc: tutor at python.org
Sent: Wed, May 18, 2011 11:51:35 AM
Subject: Re: [Tutor] can I walk or glob a website?

On 01/-10/-28163 02:59 PM, Alan Gauld wrote:
> 
> "Albert-Jan Roskam" <fomcl at yahoo.com> wrote
>> How can I walk (as in os.walk) or glob a website?
> 
> I don't think there is a way to do that via the web.
> Of course if you have access to the web servers filesystem you can use
> os.walk to do it as for any other filesystem, but I don't think its
> generally possible over http. (And indeed it shouldn''t be for very good
> security reasons!)
> 
> OTOH I've been wrong before! :-)
> 

It has to be (more or less) possible.  That's what google does for their search 
engine.

Three broad issues.

1) Are you violating the terms of service of such a web site?  Are you going to 
be doing this seldom enough that the bandwidth used won't be a DOS attack?  Are 
there copyrights to the material you plan to download?  Is the website protected 
by a login, by cookies, or a VPN?  Does the website present a different view to 
different browsers, different OS's, or different target domains?

===> This crossed my mind too. The advantage of using Python is that it's fun 
and that it saves me from gettng a mouse arm. It is a Dutch government site with 
pdf quality control reports (from the municipal health service) of 
kindergartens. I thought it would be fun and useful to make graphic 
representations of (spider charts) of each kindergarten, so they can be easily 
compared. This is just a hobby project. It just bothers me that they're not very 
easy to compare.

2) Websites vary enormously in their adherence to standards.  There are many 
such standards, and browsers tend to be very tolerant of bugs in the site which 
will be painful for you to accomodate.  And some of the extensions/features are 
very hard to parse, such as flash.  Others, such as javascript, can make it hard 
to do it statically.

===> I checked some of the deep links already. They are of the form  
[/\\\.a-z]+docid[0-9]+resultid[0-9]+ (roughly speaking), e.g.
http://www.landelijkregisterkinderopvang.nl/pp/inzien/Oko/InspectieRapport.jsf?documentId=5547&selectedResultId=5548

I could use a brute force approach and try all the doc/result id combinations. 
But wouldn't that result in a high server load? If so, I could put the program 
to sleep for n seconds.

3) How important is it to do it reliably?  Your code may work perfectly with a 
particular website, and next week they'll make a change which breaks your code 
entirely.  Are you willing to rework the code each time that happens?

===> It should be reliable. Portability to other sites is, of course, cool, but 
not strictly necessary.

Many sites have API's that you can use to access them.  Sometimes this is a 
better answer.

With all of that said, I'll point you to Beautiful Soup, as a library that'll 
parse a page of moderately correct html and give you the elements of it.  If 
it's a static page, you can then walk the elements of the tree that Beautiful 
Soup gives you, and find all the content that interests you.  You can also find 
all the web pages that the first one refers to, and recurse on that.

Notice that you need to limit your scope, since many websites have direct and 
indirect links to most of the web. For example, you might only recurse into 
links that refer to the same domain.  For many websites, that means you won't 
get it all.  So you may want to supply a list of domains and/or subdomains that 
you're willing to recurse into.

See  http://pypi.python.org/pypi/BeautifulSoup/3.2.0

===> Thanks, I'll check BS.

DaveA

Best wishes, 
Albert-Jan

_______________________________________________
Tutor maillist  -  Tutor at python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/tutor/attachments/20110518/29c4a5da/attachment.html>