Check URL --> Simply?

Wed Aug 15 11:47:14 EDT 2001

"JS" <joesalt at ireland.com> wrote in message
news:d9deee30.0108150633.b8433f7 at posting.google.com...
> One question regarding handling directories from the example below...
>
> >>> checkURL('http://www.slashdot.org')
> 1
> >>> checkURL('http://www.slashdot.org/notadirectory')
> 1
>
> In the second example, the directory doesn't exist, yet I am returned
> 1? Is there a way to handle this?

Only a very ad-hoc one, I fear...

See:

>>> import urllib
>>> x=urllib.urlopen('http://www.slashdot.org/notadirectory')
>>> x
<addinfourl at 8383084 whose fp = <socket._fileobject instance at 007FE05C>>
>>> xx=x.read()
>>> print xx[:300]
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
<HTML><HEAD><TITLE>404 File Not Found</TITLE>
 </HEAD>
<BODY bgcolor="#000000" text="#000000" link="#006666" vlink="#000000">
<CENTER>
<TABLE BORDER=0 CELLPADDING=0 CELLSPACING=0><TR><TD WIDTH=1><SCRIPT
LANGUAGE="JA
VASCRIPT">
<!--
now = new Dat
>>>

So, slashdot doesn't give an error when I try to /GET that URL -- it
appears to give a perfectly valid page.  Only, a human being can see
that the title suggest it's not a real page, but one synthesized on
the fly to give a "friendly" error message.

You may pepper your checking code with a zillion special cases
to try and identify the various "friendly error message pages"
returned by sites you're interested in -- and one day, of course,
your program will end up considering "not found" a perfectly
valid URL to a document with a title such as "404 File Not
Found" or something like that.

That's "convenience" and "friendliness" for you -- the new keywords
in which name software designers now perpetrate the worst horrors,
although that old champ of this field, "performance", isn't too far
behind as yet.

Alex