Check URL --> Simply?

Alex Martelli aleaxit at yahoo.com
Wed Aug 15 11:47:14 EDT 2001


"JS" <joesalt at ireland.com> wrote in message
news:d9deee30.0108150633.b8433f7 at posting.google.com...
> One question regarding handling directories from the example below...
>
> >>> checkURL('http://www.slashdot.org')
> 1
> >>> checkURL('http://www.slashdot.org/notadirectory')
> 1
>
> In the second example, the directory doesn't exist, yet I am returned
> 1? Is there a way to handle this?

Only a very ad-hoc one, I fear...

See:

>>> import urllib
>>> x=urllib.urlopen('http://www.slashdot.org/notadirectory')
>>> x
<addinfourl at 8383084 whose fp = <socket._fileobject instance at 007FE05C>>
>>> xx=x.read()
>>> print xx[:300]
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
<HTML><HEAD><TITLE>404 File Not Found</TITLE>
 </HEAD>
<BODY bgcolor="#000000" text="#000000" link="#006666" vlink="#000000">
<CENTER>
<TABLE BORDER=0 CELLPADDING=0 CELLSPACING=0><TR><TD WIDTH=1><SCRIPT
LANGUAGE="JA
VASCRIPT">
<!--
now = new Dat
>>>

So, slashdot doesn't give an error when I try to /GET that URL -- it
appears to give a perfectly valid page.  Only, a human being can see
that the title suggest it's not a real page, but one synthesized on
the fly to give a "friendly" error message.

You may pepper your checking code with a zillion special cases
to try and identify the various "friendly error message pages"
returned by sites you're interested in -- and one day, of course,
your program will end up considering "not found" a perfectly
valid URL to a document with a title such as "404 File Not
Found" or something like that.

That's "convenience" and "friendliness" for you -- the new keywords
in which name software designers now perpetrate the worst horrors,
although that old champ of this field, "performance", isn't too far
behind as yet.


Alex






More information about the Python-list mailing list