Web page data and urllib2.urlopen

Dave Angel davea at ieee.org
Thu Aug 6 07:25:48 CEST 2009

Massi wrote:
> Hi everyone, I'm using the urllib2 library to get the html source code
> of web pages. In general it works great, but I'm having to do with a
> financial web site which does not provide the souce code I expect. As
> a matter of fact if you try:
> import urllib2
> res = urllib2.urlopen("http://www.marketwatch.com/story/mondays-
> biggest-gaining-and-declining-stocks-2009-07-27")
> page = res.read()
> print page
> you will see that the printed code is very different from the one
> given, for example, by mozilla. Since I have really little knowledge
> in html I can't even understand if this is a python or html problem.
> Can anyone give me some help?
> Thanks in advance.
I don't think this is a Python issue, but a "raw read" versus an 
interactive interpretation of a page.  The browser does lots more than a 
single roundtrip defined by urlopen/read.

I also would love to see some explanation of what happens here, or a 
pointer to a reference that would help me understand it.

I took the output of the read(), and formatted it, roughly, as html.  I 
expected to find a refresh, which is the simplest way that one page can 
cause a very different one to be loaded.
      <meta http-equiv="refresh" content="1;url=someotherurl" />

If Mozilla had seen a page with this line in an appropriate place, it'd 
immediately begin loading the other page, at "someotherurl"  But there's 
no such line.

Next, I looked for javascript.  The Mozilla page contains lots of 
javascript, but there's none in the raw page.  So I can't explain 
Mozilla's differences that way.

I did notice the link to /m/Content/mobile2.css, but I don' t know any 
way a CSS file could cause the content to change, just the display.

All I can guess is that it has something to do with "browser type" or 
cookies.  And that would make lots of sense if this was a cgi page.  But 
the URL doesn't look like that, as it doesn't end in pl, py, asp, or any 
of another dozen special suffixes.

Any hints, anybody???


More information about the Python-list mailing list