Web page data and urllib2.urlopen
davea at ieee.org
Thu Aug 6 07:25:48 CEST 2009
> Hi everyone, I'm using the urllib2 library to get the html source code
> of web pages. In general it works great, but I'm having to do with a
> financial web site which does not provide the souce code I expect. As
> a matter of fact if you try:
> import urllib2
> res = urllib2.urlopen("http://www.marketwatch.com/story/mondays-
> page = res.read()
> print page
> you will see that the printed code is very different from the one
> given, for example, by mozilla. Since I have really little knowledge
> in html I can't even understand if this is a python or html problem.
> Can anyone give me some help?
> Thanks in advance.
I don't think this is a Python issue, but a "raw read" versus an
interactive interpretation of a page. The browser does lots more than a
single roundtrip defined by urlopen/read.
I also would love to see some explanation of what happens here, or a
pointer to a reference that would help me understand it.
I took the output of the read(), and formatted it, roughly, as html. I
expected to find a refresh, which is the simplest way that one page can
cause a very different one to be loaded.
<meta http-equiv="refresh" content="1;url=someotherurl" />
If Mozilla had seen a page with this line in an appropriate place, it'd
immediately begin loading the other page, at "someotherurl" But there's
no such line.
Mozilla's differences that way.
I did notice the link to /m/Content/mobile2.css, but I don' t know any
way a CSS file could cause the content to change, just the display.
All I can guess is that it has something to do with "browser type" or
cookies. And that would make lots of sense if this was a cgi page. But
the URL doesn't look like that, as it doesn't end in pl, py, asp, or any
of another dozen special suffixes.
Any hints, anybody???
More information about the Python-list