getting infos from a website

Quinn Dunkan quinn at hurl.ugcs.caltech.edu
Sun Mar 31 16:08:06 EST 2002


On Sat, 30 Mar 2002 16:00:23 -0500, Zutroi Zatatakowski <abou at cam.org> wrote:
>But another thing... Now that I can capture a website html and output it
>into a file, I have to remove all html tags (I guess replacing '<>' by '
>') or, but I don't know if it's possible, instead of capturing the HTML
>source of the page, could I retrieve only the text, like basic ASCII
>copy/paste?   

Depends on what you want to do with it.  If you just want to read it,
download w3m, save the text to a tmp file and capture the output of
'w3m -dump tmpfile'.

If you want to extract some part of the page you could look for a landmark,
and write a simple regexp.  Otherwise there are some HTML parsing bits in
the stdlib.  Or you could pick apart the w3m output.  How fragile your
code is to page changes depends on the page and how you do things.



More information about the Python-list mailing list