Web page data and urllib2.urlopen

Piet van Oostrum piet at cs.uu.nl
Fri Aug 7 14:54:31 CEST 2009

>>>>> Dave Angel <davea at ieee.org> (DA) wrote:

>DA> Piet van Oostrum wrote:
>>>>>>>> <snip>
>>> <snip>
>DA> But the raw page didn't have any javascript.  So what about that original
>DA> raw page triggered additional stuff to be loaded?
>DA> Is it "user agent", as someone else brought out?  And is there somewhere I
>DA> can read more about that aspect of things?  I've mostly built very static
>DA> html pages, where the server yields the same page to everybody.  And some
>DA> form stuff, where the  user clicks on a 'submit" button to trigger a script
>DA> that's not shown on the URL line.
>>> Yes, if you specify a 'normal' web browser as user agent you do get the
>>> Javascript:
>>> import urllib2
>>> request = urllib2.Request('http://www.marketwatch.com/story/mondays-biggest-gaining-and-declining-stocks-2009-07-27')
>>> request.add_header('User-Agent', 'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.5; en-US; rv: Gecko/2009073021 Firefox/3.0.13')
>>> opener = urllib2.build_opener() page = opener.open(request).read()
>>> print page
>DA> Thanks much.  That's a key I didn't understand.

You can even specify the headers in the Request constructor:

url = 'http://www.marketwatch.com/story/mondays-biggest-gaining-and-declining-stocks-2009-07-27'
hdr = {'User-Agent': 'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.5; en-US; rv: Gecko/2009073021 Firefox/3.0.13'}
request = urllib2.Request(url = url, headers = hdr)

Piet van Oostrum <piet at cs.uu.nl>
URL: http://pietvanoostrum.com [PGP 8DAE142BE17999C4]
Private email: piet at vanoostrum.org

More information about the Python-list mailing list