HTML page into a string
Steve Holden
steve at holdenweb.com
Tue Feb 7 23:20:20 EST 2006
Tempo wrote:
> In my last post I received some advice to use urllib.read() to get a
> whole html page as a string, which will then allow me to use
> BeautifulSoup to do what I want with the string. But when I was
> researching the 'urllib' module I couldn't find anything about its
> sub-section '.read()' ? Is that the right module to get a html page
> into a string? Or am I completely missing something here? I'll take
> this as the more likely of the two cases. Thanks for any and all help.
>
I think you've misunderstood. You call urllib.urlopen() with a URL as an
argument. The object that this call returns is file-like (in so far as
you can read it to get the content of the web page):
>>> import urllib
>>> page = urllib.urlopen("http://www.holdenweb.com/")
>>> data = page.read()
>>> print data
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<meta http-equiv="content-type" content="text/html;charset=ISO-8859-1">
<meta name="generator" content="Adobe GoLive 6">
<meta http-equiv="DESCRIPTION" content="Holden Web provides
architectural design of databases and information systems, with
full-service implementation and support">
...
</tr>
</tbody>
</table>
</div>
</body>
</html>
>>>
You will find there are lots of other things you can do with that
file-like object too, but reading it is the important one as far as
using BeautifulSoup goes.
regards
Steve
--
Steve Holden +44 150 684 7255 +1 800 494 3119
Holden Web LLC www.holdenweb.com
PyCon TX 2006 www.python.org/pycon/
More information about the Python-list
mailing list