Urllib's urlopen and urlretrieve
Dave Angel
davea at davea.name
Thu Feb 21 13:53:25 EST 2013
On 02/21/2013 07:12 AM, qoresucks at gmail.com wrote:
>
> <snip>
> Why is it that when using urllib.urlopen then reading or urllib.urlretrieve, does it only give me parts of the sites, loosing the formatting, images, etc...? How can I get around this?
>
Start by telling us if you're using Python2 or Python3, as this library
is different for different versions. Also what OS, as there are lots of
useful utilities in Unix, and a different set in Windows or other
places. Even if the same program exists on both, it's likely to be
named differently.
My earlier reply assumed you were trying to get an accurate copy of your
website, presumably because your own local copy had gotten out of synch.
rh assumed differently, so I'll try again. If you're trying to
download someone else's, you should realize that you may be violating
copyright, and ought to get permission. It's one thing to extract a
file or two, but another entirely to try to capture the entire site.
And many sites consider all of the details proprietary. Others consider
the images proprietary, and enforce the individual copyrights.
You can indeed copy individual files with urlib or urlib2, but that's
just the start of the problem. A typical web page is written in html
(or xhtml, or ...), and displaying it is the job of a browser, not the
cat command. In addition, the page will generally refer to lots of
other files, with the most common being a css file and a few jpegs. So
you have to parse the page to find all those dependencies, and copy them
as well.
Next, the page may contain code (eg. php, javascript), or it may be code
(eg. Python or perl). In each of those cases, what you'll get isn't
exactly what you'd expect. If you try to fetch a python program,
generally what happens is it gets run, and you fetch its stdout instead.
On the other hand javascript gets executed by the browser, and I don't
know where php gets executed, or by whom. Finally, the page may make
use of resources which simply won't be visible to you without becoming a
hacker. Like my rsync and scp examples, you'll probably need a userid
and password to get into the guts.
If you want to play with some of this without programming, you could go
to your favorite browser, and View->Source. The method of doing that
varies with browser brand, version & OS, but it should be there on some
menu someplace. In Chrome, it's Tools->ViewSource.
Examples below extracted from the main page at python.org
<title>Python Programming Language – Official Website</title>
That simply sets the title for the page. It is not even part of the
body, it's part of the header for the page. In this case, the header
continues for 77 pages, including meta tags, javascript stuff, css
stuff, etc.
You might observe that angle brackets are used to enclose explicit kinds
of data. In the above example, it's a "title" element. And it's
enclosed with <title> and </title>
In xhtml, these will always come in pairs, like curly braces in C
programming. However, most web pages are busted, so parsing it is
sometimes troublesome. Most people seem to recommand Beautiful Soup, in
part because it tolerates many kinds of errors.
I'd get a good book on html programming, making sure it covers xhtml and
css. But I don't know what to recommend, as everything in my arsenal is
thoroughly dated.
Much of the body is devoted to the complexity of setting up the page in
a browser of variable size, varying fonts, user-overrides, etc. The
following exerpt:
> <div style="align:center; padding-top: 0.5em; padding-left: 1em">
> <a href="/psf/donations/"><img width="116" height="42"
> src="/images/donate.png" alt="" title="" /></a>
> </div>
The whole thing is a "div" or division. It's a individual chunk of the
page that might be placed almost anywhere within a bigger div or the
page itself. It has a style attribute, which gives hints to the browser
about what it wants. More commonly, the style will be indirected
through a separate css page.
It has an "a" tag, which shows a link. The link may be underlined, but
the css or the browser may override that. The url for the link is
specified in the 'src' attribute, the tooltip is specified in the alt
attribute. This is enclosing an 'img' tag, which describes a png image
file to be displayed, and specifies the scaling for it.
> <h4><a href="/about/help/">Help</a></h4>
The h4 tag refers to css which specifies various things about how
this'll display. It's usually used for making larger and smaller
versions of text for titles and such.
> <link rel="stylesheet" type="text/css" media="screen"
> id="screen-switcher-stylesheet"
> href="/styles/screen-switcher-default.css" />
This points to a css file, which refers to another one, called
styles.css. That's where you can see the definition for a style of h4
> H1,H2,H3,H4,H5 {
> font-family: Georgia, "Bitstream Vera Serif",
> "New York", Palatino, serif;
> font-weight:normal;
> line-height: 1em;
> }
This defines the common attributes for all the Hn series. Then they are
refined and overridden by:
> H4
> {
> font-size: 125%;
> color: #366D9C;
> margin: 0.4em 0 0.0em 0;
> }
So we see that H4 is 25% bigger than default. Similarly H3 is 35%, and
H2 is 40% bigger.
It's a very complicated topic, and I wish you luck on it. But it's not
clear that the first step should involve any Python programming. I got
all the above just with Chrome in its default setup. I haven't even
mentioned things like the Tools->DeveloperTools, or other stuff you
could get via plugins.
If you're copying these files with a view of being able to run them
locally, realize that for most websites, you need lots of installed
software to support being a webserver. If you're writing your own, you
can start simple, and maybe never need any of the extra tools. For
example, on my own website, I only needed static pages. So the python
code I used was to generate the web pages, which are then uploaded as is
to the site. They can be tested locally by simply making up a url which
starts
file://
instead of
http://
But as soon as I want database features, or counters, or user accounts,
or data entry, or randomness, I might add code that runs on the server,
and that's a lot trickier. Probably someone who has done it can tell us
I'm all wet, though.
--
DaveA
More information about the Python-list
mailing list