Check URL --> Simply? (fwd)

Julius Welby jwelby at waitrose.com
Thu Aug 16 00:56:55 EDT 2001


That looks pretty comprehensive!

I thought I should repost a URL to my availability checker with e-mail alert
(alpha version), as it checks for the ability to do a URL open, and then
looks for a piece of arbitrary text to signify a failed attempt, so it is
on-topic.

http://www.outwardlynormal.com/python/igor.txt

I've only tested it for one recipient of the e-mail, but it works fine for
me on multiple sites (you define a list of sites to check).

I'll rewrite it at some point, but I hope this is of some use.



"Dr. David Mertz" <mertz at gnosis.cx> wrote in message
news:mailman.997923024.24854.python-list at python.org...
> "Alex Martelli" <aleaxit at yahoo.com> with usual wisdom wrote:
> |So, slashdot doesn't give an error when I try to /GET that URL -- it
> |appears to give a perfectly valid page....
> |You may pepper your checking code with a zillion special cases
> |to try and identify the various "friendly error message pages"
> |returned by sites you're interested in -- and one day, of course,
> |your program will end up considering "not found" a perfectly
> |valid URL to a document with a title such as "404 File Not
> |Found" or something like that.
>
> Ever more opportunity at shameless self-promotion.  This zillion special
> cases of 404-ish pages is something I use as an example in my
> forthcoming book _Text Processing in Python_ (a few more months until
> done).  Here's the code I present as an attempt at recognizing what only
> humans can:
>
>     #---------- error_page.py ----------#
>     import re, sys
>     page = sys.stdin.read()
>
>     # Mapping from patterns to probability contribution of pattern
>     err_pats = {r'(?is)<TITLE>.*?(404|403).*?ERROR.*?</TITLE>': 0.95,
>                 r'(?is)<TITLE>.*?ERROR.*?(404|403).*?</TITLE>': 0.95,
>                 r'(?is)<TITLE>ERROR</TITLE>': 0.30,
>                 r'(?is)<TITLE>.*?ERROR.*?</TITLE>': 0.10,
>                 r'(?is)<META .*?(404|403).*?ERROR.*?>': 0.80,
>                 r'(?is)<META .*?ERROR.*?(404|403).*?>': 0.80,
>                 r'(?is)<TITLE>.*?File Not Found.*?</TITLE>': 0.80,
>                 r'(?is)<TITLE>.*?Not Found.*?</TITLE>': 0.40,
>                 r'(?is)<BODY.*(404|403).*</BODY>': 0.10,
>                 r'(?is)<H1>.*?(404|403).*?</H1>': 0.15,
>                 r'(?is)<BODY.*not found.*</BODY>': 0.10,
>                 r'(?is)<H1>.*?not found.*?</H1>': 0.15,
>                 r'(?is)<BODY.*the requested URL.*</BODY>': 0.10,
>                 r'(?is)<BODY.*the page you requested.*</BODY>': 0.10,
>                 r'(?is)<BODY.*page.{1,50}unavailable.*</BODY>': 0.10,
>                 r'(?is)<BODY.*request.{1,50}unavailable.*</BODY>': 0.10,
>                 r'(?i)does not exist': 0.10,
>                }
>     err_prob = 0
>     for pat, prob in err_pats.items():
>         if err_prob > 0.9: break
>         if re.search(pat, page):
>             # print pat, prob
>             err_prob += prob
>
>     if err_prob > 0.90:   print 'Page is almost surely an error report'
>     elif err_prob > 0.75: print 'It is highly likely page is an error
report'
>     elif err_prob > 0.50: print 'Better-than-even odds page is error
report'
>     elif err_prob > 0.25: print 'Fair indication page is an error report'
>     else:                 print 'Page is probably real content'
>
> You could play with this to include urllib.urlopen() instead of just
> reading STDIN, or course.  The threshhold approach, IMO, does pretty
> well.  But nothing's perfect... in fact, I've found pages that I have a
> lot of trouble saying for sure whether they are real content or not
> using my own eyes.
>
> Yours, David...
>





More information about the Python-list mailing list