What is the best way to "get" a web page?

Pete harbingerofpeace at post.com
Sun Sep 24 18:26:53 CEST 2006


> > The file "temp.html" is definitely different than the first run, but
> > still not anything close to www.python.org . Any other suggestions?
>
> If you mean that the page looks different in a browser, for one thing
> you have to download the css files too. Here's the relevant extract
> from the main page:
>
> <link media="screen" href="styles/screen-switcher-default.css"
> type="text/css" id="screen-switcher-stylesheet" rel="stylesheet" />
> <link media="scReen" href="styles/netscape4.css" type="text/css"
> rel="stylesheet" />
> <link media="print" href="styles/print.css" type="text/css"
> rel="stylesheet" />
> <link media="screen" href="styles/largestyles.css" type="text/css"
> rel="alternate stylesheet" title="large text" />
> <link media="screen" href="styles/defaultfonts.css" type="text/css"
> rel="alternate stylesheet" title="default fonts" />
>
> You may either hardcode the urls of the css files, or parse the page,
> extract the css links and normalize them to absolute urls. The first is
> simpler but the second is more robust, in case a new css is added or an
> existing one is renamed or removed.
>
> George

Thanks for the information on CSS. I'll look into that later, but now
my question is on the first two lines of HTML code. Here's my latest
python code:

>>> import urllib
>>> web_page = urllib.urlopen("http://www.python.org")
>>> fileTemp = open("temp.html", "w")
>>> web_page_contents = web_page.read()
>>> fileTemp.write(web_page_contents)
>>> fileTemp.close()

Here are the first two lines of temp.html:

      1 <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/x        html1/DTD/xhtml1-transitional.dtd">
      2 <html lang="en" xml:lang="en"
xmlns="http://www.w3.org/1999/xhtml">

Here are the first two lines of www.python.org as saved from Firefox:

      1 <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/x        html1/DTD/xhtml1-transitional.dtd">
      2 <html xml:lang="en" xmlns="http://www.w3.org/1999/xhtml"
lang="en"><head>

Lines one are identical. Lines two are different. Why would lines two
differ? Hmmmm...

Thanks,
Pete




More information about the Python-list mailing list