[Tutor] can I walk or glob a website?

Wed May 18 19:48:32 CEST 2011

On Wed, May 18, 2011 at 2:06 AM, Albert-Jan Roskam <fomcl at yahoo.com> wrote:

> Hello,
>
> How can I walk (as in os.walk) or glob a website? I want to download all
> the pdfs from a website (using urllib.urlretrieve), extract certain figures
> (using pypdf- is this flexible enough?) and make some statistics/graphs from
> those figures (using rpy and R). I forgot what the process of 'automatically
> downloading' is called again, something that sounds like 'whacking' (??)
>
>
I think the word you're looking for is "scraping".

I actually did something (roughly) similar a few years ago, to download a
collection of free Russian audiobooks for my father-in-law (an avid reader
who was quickly going blind.)

I crawled the site looking for .mp3 files, then returned a tree from which I
could select files to be downloaded.  It's horribly crude, in retrospect,
and I'm embarrassed re-reading my code - but if you're interested I can
forward it (if only as an example of what _not_to do.)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/tutor/attachments/20110518/081b81bf/attachment.html>