Problem with popen() and a regular expression

marduk marduk at python.net
Tue Mar 5 08:04:53 EST 2002


You're better off using htmllib (and httplib for that matter).  I haven't
looked at your examples, but I can tell just by looking at your code that
it will fail, for example, when "<title>" and "</title>" are on seperate
lines in the HTML.  Also, I might be mistaken, but it doesn't look like
you are re search is case insensitive.

Either way, try the htmllib.  It takes care of all that stuff for you!

--m

On Tue, 05 Mar 2002 06:43:10 -0600, Simon Willison wrote:

> I've written a simple Python script to scan a bunch of URLs for "live"
> sites and grab the title of those pages. It works by using popen() to
> call lynx and analyse the HTTP response:
> 
> -----------------------------------------------------------------
> 
> command = "/opt/bin/lynx -mime_header http://www.bath.ac.uk/~"+user+"/"
> f = os.popen(command)
> l = f.readline()  # Read first line of output, the HTTP status line try:
> 	# Look for '200' HTTP status code indicating page exists i =
> 	l.index('200')
> except ValueError:
> 	i = 0
> if i:
> 	print "Web page exists!"
> 	# Now try and get the title of the page for line in f.readlines():
> 		line = string.strip(line)
> 		# look for <title>*</title> using regular expression result =
> 		re.search('\s*<title>([^<]*)</title>\s*', line) if result:
> 			# Found the title
> 			title = result.group(1)
> 			found[user] = title
> 			print "Page Title: " + title
> 			break
> 	if not found.has_key(user):
> 		found[user] = "[unknown]"
> 
> -----------------------------------------------------------------
> 
> You can see the output of the script here:
> 
> http://www.bath.ac.uk/~cs1spw/cs1sites.html
> 
> As you can see, the code works fine for most pages but fails to grab the
> <title> tag on some of them (resulting in an [unknown] entry in the
> list). I've checked the HTML for these pages  and a <title></title> tag
> exists in a form that should be picked up by my regular expression, but
> for some reason it just doesn't work.
> 
> Any ideas?

 Posted Via Usenet.com Premium Usenet Newsgroup Services
----------------------------------------------------------
    ** SPEED ** RETENTION ** COMPLETION ** ANONYMITY **
----------------------------------------------------------        
                http://www.usenet.com



More information about the Python-list mailing list