Check URL --> Simply?
Dr. David Mertz
mertz at gnosis.cx
Thu Aug 16 12:24:55 EDT 2001
David Eppstein <eppstein at ics.uci.edu> wrote:
|Humans can't usually see the http response line with the actual 404 number
|in it in place of the 200 indicating an ok page, but machines can -- why
|don't you use that?
A couple reasons. For one thing, my example was used to illustrate use
of some regular expressions. But as an answer to the thread, that's
pretty weak :-).
Let's try again: Because one might have previously crawled or proxied a
site, and want to figure out which downloaded pages are good (maybe not
using custom Python code for the crawling). Not too bad.
But better still:
#---------- check_url.py ----------#
from httplib import HTTP
from urlparse import urlparse
def checkURL(url):
p = urlparse(url)
h = HTTP(p[1])
h.putrequest('HEAD', p[2])
h.endheaders()
return h.getreply()
if __name__ == '__main__':
for url in ('http://msnbc.com/nonsense','http://msnbc.com/',
'http://w3c.org/','http://w3c.org/nonsense',
'http://w3c.org/Consortium/','http://ibm.com/',
'http://ibm.com/nonsense'):
print url, checkURL(url)[:2]
------------------------------------------------------------------------
% python check_url.py
http://msnbc.com/nonsense (200, 'OK')
http://msnbc.com/ (302, 'Object moved')
http://w3c.org/ (301, 'Moved Permanently')
http://w3c.org/nonsense (301, 'Moved Permanently')
http://w3c.org/Consortium/ (301, 'Moved Permanently')
http://ibm.com/ (200, 'OK')
http://ibm.com/nonsense (404, 'Not Found')
I tried a few sites to get these examples... but not all *that* many.
All the sites that end in 'nonsense' LOOK, to my human eyes, like broken
links... and all the others look like content (well, except msnbc.com,
which refuses to load--I think because I won't give it a cookie--and
wouldn't actually be other than nonsense if it would load :-)).
More information about the Python-list
mailing list