Check URL --> Simply? (fwd)

Wed Aug 15 20:34:07 EDT 2001

"Alex Martelli" <aleaxit at yahoo.com> with usual wisdom wrote:
|So, slashdot doesn't give an error when I try to /GET that URL -- it
|appears to give a perfectly valid page....
|You may pepper your checking code with a zillion special cases
|to try and identify the various "friendly error message pages"
|returned by sites you're interested in -- and one day, of course,
|your program will end up considering "not found" a perfectly
|valid URL to a document with a title such as "404 File Not
|Found" or something like that.

Ever more opportunity at shameless self-promotion.  This zillion special
cases of 404-ish pages is something I use as an example in my
forthcoming book _Text Processing in Python_ (a few more months until
done).  Here's the code I present as an attempt at recognizing what only
humans can:

    #---------- error_page.py ----------#
    import re, sys
    page = sys.stdin.read()

    # Mapping from patterns to probability contribution of pattern
    err_pats = {r'(?is)<TITLE>.*?(404|403).*?ERROR.*?</TITLE>': 0.95,
                r'(?is)<TITLE>.*?ERROR.*?(404|403).*?</TITLE>': 0.95,
                r'(?is)<TITLE>ERROR</TITLE>': 0.30,
                r'(?is)<TITLE>.*?ERROR.*?</TITLE>': 0.10,
                r'(?is)<META .*?(404|403).*?ERROR.*?>': 0.80,
                r'(?is)<META .*?ERROR.*?(404|403).*?>': 0.80,
                r'(?is)<TITLE>.*?File Not Found.*?</TITLE>': 0.80,
                r'(?is)<TITLE>.*?Not Found.*?</TITLE>': 0.40,
                r'(?is)<BODY.*(404|403).*</BODY>': 0.10,
                r'(?is)<H1>.*?(404|403).*?</H1>': 0.15,
                r'(?is)<BODY.*not found.*</BODY>': 0.10,
                r'(?is)<H1>.*?not found.*?</H1>': 0.15,
                r'(?is)<BODY.*the requested URL.*</BODY>': 0.10,
                r'(?is)<BODY.*the page you requested.*</BODY>': 0.10,
                r'(?is)<BODY.*page.{1,50}unavailable.*</BODY>': 0.10,
                r'(?is)<BODY.*request.{1,50}unavailable.*</BODY>': 0.10,
                r'(?i)does not exist': 0.10,
               }
    err_prob = 0
    for pat, prob in err_pats.items():
        if err_prob > 0.9: break
        if re.search(pat, page):
            # print pat, prob
            err_prob += prob

    if err_prob > 0.90:   print 'Page is almost surely an error report'
    elif err_prob > 0.75: print 'It is highly likely page is an error report'
    elif err_prob > 0.50: print 'Better-than-even odds page is error report'
    elif err_prob > 0.25: print 'Fair indication page is an error report'
    else:                 print 'Page is probably real content'

You could play with this to include urllib.urlopen() instead of just
reading STDIN, or course.  The threshhold approach, IMO, does pretty
well.  But nothing's perfect... in fact, I've found pages that I have a
lot of trouble saying for sure whether they are real content or not
using my own eyes.

Yours, David...