read all available pages on a Website
timr at probo.com
Mon Sep 13 08:13:49 CEST 2004
Brad Tilley <bradtilley at usa.net> wrote:
>Is there a way to make urllib or urllib2 read all of the pages on a Web
>site? For example, say I wanted to read each page of www.python.org into
>separate strings (a string for each page). The problem is that I don't
>know how many pages are at www.python.org. How can I handle this?
You have to parse the HTML to pull out all the links and images and fetch
them, one by one. sgmllib can help with the parsing. You can multithread
this, if performance in an issue.
By the way, there are many web sites for which this sort of behavior is not
- Tim Roberts, timr at probo.com
Providenza & Boekelheide, Inc.
More information about the Python-list