Problem with popen() and a regular expression
marduk
marduk at python.net
Tue Mar 5 08:04:53 EST 2002
You're better off using htmllib (and httplib for that matter). I haven't
looked at your examples, but I can tell just by looking at your code that
it will fail, for example, when "<title>" and "</title>" are on seperate
lines in the HTML. Also, I might be mistaken, but it doesn't look like
you are re search is case insensitive.
Either way, try the htmllib. It takes care of all that stuff for you!
--m
On Tue, 05 Mar 2002 06:43:10 -0600, Simon Willison wrote:
> I've written a simple Python script to scan a bunch of URLs for "live"
> sites and grab the title of those pages. It works by using popen() to
> call lynx and analyse the HTTP response:
>
> -----------------------------------------------------------------
>
> command = "/opt/bin/lynx -mime_header http://www.bath.ac.uk/~"+user+"/"
> f = os.popen(command)
> l = f.readline() # Read first line of output, the HTTP status line try:
> # Look for '200' HTTP status code indicating page exists i =
> l.index('200')
> except ValueError:
> i = 0
> if i:
> print "Web page exists!"
> # Now try and get the title of the page for line in f.readlines():
> line = string.strip(line)
> # look for <title>*</title> using regular expression result =
> re.search('\s*<title>([^<]*)</title>\s*', line) if result:
> # Found the title
> title = result.group(1)
> found[user] = title
> print "Page Title: " + title
> break
> if not found.has_key(user):
> found[user] = "[unknown]"
>
> -----------------------------------------------------------------
>
> You can see the output of the script here:
>
> http://www.bath.ac.uk/~cs1spw/cs1sites.html
>
> As you can see, the code works fine for most pages but fails to grab the
> <title> tag on some of them (resulting in an [unknown] entry in the
> list). I've checked the HTML for these pages and a <title></title> tag
> exists in a form that should be picked up by my regular expression, but
> for some reason it just doesn't work.
>
> Any ideas?
Posted Via Usenet.com Premium Usenet Newsgroup Services
----------------------------------------------------------
** SPEED ** RETENTION ** COMPLETION ** ANONYMITY **
----------------------------------------------------------
http://www.usenet.com
More information about the Python-list
mailing list