Problem with popen() and a regular expression
Simon Willison
cs1spw at bath.ac.uk
Tue Mar 5 07:43:10 EST 2002
I've written a simple Python script to scan a bunch of URLs for "live"
sites and grab the title of those pages. It works by using popen() to
call lynx and analyse the HTTP response:
-----------------------------------------------------------------
command = "/opt/bin/lynx -mime_header http://www.bath.ac.uk/~"+user+"/"
f = os.popen(command)
l = f.readline() # Read first line of output, the HTTP status line
try:
# Look for '200' HTTP status code indicating page exists
i = l.index('200')
except ValueError:
i = 0
if i:
print "Web page exists!"
# Now try and get the title of the page
for line in f.readlines():
line = string.strip(line)
# look for <title>*</title> using regular expression
result = re.search('\s*<title>([^<]*)</title>\s*', line)
if result:
# Found the title
title = result.group(1)
found[user] = title
print "Page Title: " + title
break
if not found.has_key(user):
found[user] = "[unknown]"
-----------------------------------------------------------------
You can see the output of the script here:
http://www.bath.ac.uk/~cs1spw/cs1sites.html
As you can see, the code works fine for most pages but fails to grab the
<title> tag on some of them (resulting in an [unknown] entry in the
list). I've checked the HTML for these pages and a <title></title> tag
exists in a form that should be picked up by my regular expression, but
for some reason it just doesn't work.
Any ideas?
More information about the Python-list
mailing list