Web page data and urllib2.urlopen

Piet van Oostrum piet at cs.uu.nl
Fri Aug 7 11:14:30 CEST 2009


>>>>> Dave Angel <davea at ieee.org> (DA) wrote:

>DA> Piet van Oostrum wrote:
>>> <snip>
>DA> If Mozilla had seen a page with this line in an appropriate place, it'd
>DA> immediately begin loading the other page, at "someotherurl"  But there's no
>DA> such line.
>>>> 
>>> 
>>> 
>DA> Next, I looked for javascript.  The Mozilla page contains lots of
>DA> javascript, but there's none in the raw page.  So I can't explain Mozilla's
>DA> differences that way.
>>>> 
>>> 
>>> 
>DA> I did notice the link to /m/Content/mobile2.css, but I don' t know any way
>DA> a CSS file could cause the content to change, just the display.
>>>> 
>>> 
>>> 
>DA> All I can guess is that it has something to do with "browser type" or
>DA> cookies.  And that would make lots of sense if this was a cgi page.  But
>DA> the URL doesn't look like that, as it doesn't end in pl, py, asp, or any of
>DA> another dozen special suffixes.
>>>> 
>>> 
>>> 
>DA> Any hints, anybody???
>>>> 
>>> 
>>> If you look into the HTML that Firefox gets, there is a lot of
>>> javascript in it.
>>> 

>DA> But the raw page didn't have any javascript.  So what about that original
>DA> raw page triggered additional stuff to be loaded?
>DA> Is it "user agent", as someone else brought out?  And is there somewhere I
>DA> can read more about that aspect of things?  I've mostly built very static
>DA> html pages, where the server yields the same page to everybody.  And some
>DA> form stuff, where the  user clicks on a 'submit" button to trigger a script
>DA> that's not shown on the URL line.

Yes, if you specify a 'normal' web browser as user agent you do get the
Javascript:

import urllib2

request = urllib2.Request('http://www.marketwatch.com/story/mondays-biggest-gaining-and-declining-stocks-2009-07-27')
request.add_header('User-Agent', 'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.5; en-US; rv:1.9.0.13) Gecko/2009073021 Firefox/3.0.13')

opener = urllib2.build_opener() 
page = opener.open(request).read()
print page

-- 
Piet van Oostrum <piet at cs.uu.nl>
URL: http://pietvanoostrum.com [PGP 8DAE142BE17999C4]
Private email: piet at vanoostrum.org



More information about the Python-list mailing list