scraping a web page

Richard Jones richard at bizarsoftware.com.au
Mon Sep 10 22:12:23 EDT 2001


On Tuesday 11 September 2001 11:58, Tom Harris wrote:
> I used to have a script that would automatically renew my library books by
> talking to the library catalogue via telnet. I tried to use it again and I
> find that they have removed the telnet option, and just left the CGI web
> page. What is the suggested Pythonic solution to retreiving information
> from http pages, I could use a regex but this would take some care to be
> less than extremely fragile, and would probably break every time a small
> change was made to the page.

Use HTMLParser from the standard library. It lets you define do_foo() methods 
when you extend it - where "foo" is the tag name you wish to handle. For 
example, do_img(self, attributes) will be called for every <img> tag in the 
source, with the attributes of the tag passed in as a list of 2-tuples.


    Richard




More information about the Python-list mailing list