Check URL --> Simply?

Thu Aug 16 12:24:55 EDT 2001

David Eppstein <eppstein at ics.uci.edu> wrote:
|Humans can't usually see the http response line with the actual 404 number
|in it in place of the 200 indicating an ok page, but machines can -- why
|don't you use that?

A couple reasons.  For one thing, my example was used to illustrate use
of some regular expressions.  But as an answer to the thread, that's
pretty weak :-).

Let's try again:  Because one might have previously crawled or proxied a
site, and want to figure out which downloaded pages are good (maybe not
using custom Python code for the crawling).  Not too bad.

But better still:

    #---------- check_url.py ----------#
    from httplib import HTTP
    from urlparse import urlparse

    def checkURL(url):
        p = urlparse(url)
        h = HTTP(p[1])
        h.putrequest('HEAD', p[2])
        h.endheaders()
        return h.getreply()

    if __name__ == '__main__':
        for url in ('http://msnbc.com/nonsense','http://msnbc.com/',
                    'http://w3c.org/','http://w3c.org/nonsense',
                    'http://w3c.org/Consortium/','http://ibm.com/',
                    'http://ibm.com/nonsense'):
            print url, checkURL(url)[:2]

    ------------------------------------------------------------------------
    % python check_url.py
    http://msnbc.com/nonsense (200, 'OK')
    http://msnbc.com/ (302, 'Object moved')
    http://w3c.org/ (301, 'Moved Permanently')
    http://w3c.org/nonsense (301, 'Moved Permanently')
    http://w3c.org/Consortium/ (301, 'Moved Permanently')
    http://ibm.com/ (200, 'OK')
    http://ibm.com/nonsense (404, 'Not Found')

I tried a few sites to get these examples... but not all *that* many.
All the sites that end in 'nonsense' LOOK, to my human eyes, like broken
links... and all the others look like content (well, except msnbc.com,
which refuses to load--I think because I won't give it a cookie--and
wouldn't actually be other than nonsense if it would load :-)).