Problem with popen() and a regular expression

Simon Willison cs1spw at bath.ac.uk
Tue Mar 5 07:43:10 EST 2002


I've written a simple Python script to scan a bunch of URLs for "live" 
sites and grab the title of those pages. It works by using popen() to 
call lynx and analyse the HTTP response:

-----------------------------------------------------------------

command = "/opt/bin/lynx -mime_header http://www.bath.ac.uk/~"+user+"/"
f = os.popen(command)
l = f.readline()  # Read first line of output, the HTTP status line
try:
	# Look for '200' HTTP status code indicating page exists
	i = l.index('200')
except ValueError:
	i = 0
if i:
	print "Web page exists!"
	# Now try and get the title of the page
	for line in f.readlines():
		line = string.strip(line)
		# look for <title>*</title> using regular expression
		result = re.search('\s*<title>([^<]*)</title>\s*', line)
		if result:
			# Found the title
			title = result.group(1)
			found[user] = title
			print "Page Title: " + title
			break
	if not found.has_key(user):
		found[user] = "[unknown]"

-----------------------------------------------------------------

You can see the output of the script here:

http://www.bath.ac.uk/~cs1spw/cs1sites.html

As you can see, the code works fine for most pages but fails to grab the 
<title> tag on some of them (resulting in an [unknown] entry in the 
list). I've checked the HTML for these pages  and a <title></title> tag 
exists in a form that should be picked up by my regular expression, but 
for some reason it just doesn't work.

Any ideas?




More information about the Python-list mailing list