[Tutor] html programming ???

Jeff Shannon jeff@ccvcorp.com
Tue, 02 Oct 2001 09:50:23 -0700


> hi all,
> i am trying to write python script which takes a url and opens that page
> ...then i am trying to grab all  the links present in that page and want
> to put that links in a data structure.....
> now my problem is that how do i use a data structure and how should i put
> this links into that.....

Hi Samir

A few minor pointers.  First, the re module is probably more powerful than what you need.  It will certainly work, but you could do much the
same thing with string.find(), which is much simpler.  However, I'll keep the re in place in my example below, so that this change doesn't
obscure the other changes I'm making.  Also, your 'while not done' construct seems a bit awkward--I'll show you a more standard idiom
below.  I'm also putting everything into functions--it makes for cleaner, more understandable code, and makes it easier to expand on this
later.  Finally, on to your real question--how to store your links.  What you need to do is create a lits, and then add any links that
you've found to that list.

So, here's my quick (untested) rewrite of your script:

import sys
import urllib

def findlinks(url):
    try:
        fp = urllib.urlopen(url)
    except IOError:
        return []  # return an empty list if we can't get a page

    results = []  # create an empty list to store results in

    while 1:   # loop until something else stops us
        line = fp.readline()
        if line == "":
            break         #if we're at end-of-file, then break out of while loop
        links = re.findall('href="(.*?)"', line)  # I prefer using plural identifiers for lists...
        results = results + links   # adding two lists creates a list of all items in both lists

    fp.close()
    return results



input_url = sys.argv[1]  # might be good to add some error checking here...
found = findlinks(input_url)

print "Links found in %s", input_url
for item in found:
    print item


Two things worth further comment--in your initial try/except block, you assigned an error-message string to fp if the urlopen() failed.
However, when you do fp.readline() or fp.close() on that string, you'll throw another exception.  It seems to me that returning an empty
list is the simplest error-handling.  You may want to do something else, depending on your intent.

The bigger issue, though, is that you're going through the file (url) line-by-line, when there's no need to at all.  The re.findall() should
be able to handle strings of fairly considerable length.  So you can replace that central while loop with this:

    text = fp.read()
    results = re.findall('href="(.*?)"', text)

Hope that this helps!


Jeff Shannon
Technician/Programmer
Credit International