Check URL --> Simply?

Garth Grimm garth_grimm at hp.com
Wed Aug 15 14:00:32 EDT 2001


That's about the best solution possible.

Unfortunately, there are some dynamic sites that actually trap what should
be a 404, and instead return a valid 200 and a web page document.
Broadvision being one notable culprit, which often is configured so that a
request to a page that doesn't exist is responded to with the home page of
the web site.

As far as a human user is concerned, this is a dead link, but it's next to
impossible to detect in programming code.  You end up having to make
customized solutions in this case.

--
Garth


"Andy McKay" <andym at ActiveState.com> wrote in message
news:mailman.997892665.9674.python-list at python.org...
> Just check the headers, this means you dont get the document, saving
> bandwith and solving the 404 file problem (assuming the set the right
> headers)
>
> from httplib import HTTP
> from urlparse import urlparse
>
> def checkURL(url):
>      p = urlparse(url)
>      h = HTTP(p[1])
>      h.putrequest('HEAD', p[2])
>      h.endheaders()
>      if h.getreply()[0] == 200: return 1
>      else: return 0
>
> if __name__ == '__main__':
>      assert checkURL('http://slashdot.org')
>      assert not checkURL('http://slashdot.org/notadirectory')
>
> --
>    Andy McKay
>
>





More information about the Python-list mailing list