Fwd: [Tutor] html programming ???
Wed, 3 Oct 2001 17:04:28 +1000 (EST)
hi Jeff shannon,
thanx for your help and i actually implement the things u told me to do
and also i have tried to add links which i am getting to a Queue ....
now another problem of mine is that i have to call this function
recursively becoz i want to find links inside the links i have already
found and this to a depth of 3 so how does recursion work in python ????
anyway thanx for your help
On Wed, 3 Oct 2001, Samir Patel wrote:
> >From: "Jeff Shannon"
> >To: email@example.com
> >Subject: [Tutor] html programming ???
> >Date: Tue, 02 Oct 2001 09:50:23 -0700
> > > hi all,
> > > i am trying to write python script which takes a url and opens that page
> > > ...then i am trying to grab all the links present in that page and want
> > > to put that links in a data structure.....
> > > now my problem is that how do i use a data structure and how should i put
> > > this links into that.....
> >Hi Samir
> >A few minor pointers. First, the re module is probably more powerful than what you need. It will certainly work, but
> you could do much the
> >same thing with string.find(), which is much simpler. However, I'll keep the re in place in my example below, so that
> this change doesn't
> >obscure the other changes I'm making. Also, your 'while not done' construct seems a bit awkward--I'll show you a more
> standard idiom
> >below. I'm also putting everything into functions--it makes for cleaner, more understandable code, and makes it
> easier to expand on this
> >later. Finally, on to your real question--how to store your links. What you need to do is create a lits, and then add
> any links that
> >you've found to that list.
> >So, here's my quick (untested) rewrite of your script:
> >import sys
> >import urllib
> >def findlinks(url):
> > try:
> > fp = urllib.urlopen(url)
> > except IOError:
> > return  # return an empty list if we can't get a page
> > results =  # create an empty list to store results in
> > while 1: # loop until something else stops us
> > line = fp.readline()
> > if line == "":
> > break #if we're at end-of-file, then break out of while loop
> > links =re.findall('href="https://www.safeweb.com/o/_o(410):_win(1):_w:_base(http://lw3fd.law3.hotmail.msn.com/cgi-bin/dasp/EN/):_bs
> http://lw3fd.law3.hotmail.msn.com/cgi-bin/dasp/EN/):(.*?)"', line) # I prefer using plural identifiers for lists...
> > results = results + links # adding two lists creates a list of all items in both lists
> > fp.close()
> > return results
> >input_url = sys.argv # might be good to add some error checking here...
> >found = findlinks(input_url)
> >print "Links found in %s", input_url
> >for item in found:
> > print item
> >Two things worth further comment--in your initial try/except block, you assigned an error-message string to fp if the
> urlopen() failed.
> >However, when you do fp.readline() or fp.close() on that string, you'll throw another exception. It seems to me that
> returning an empty
> >list is the simplest error-handling. You may want to do something else, depending on your intent.
> >The bigger issue, though, is that you're going through the file (url) line-by-line, when there's no need to at all.
> The re.findall() should
> >be able to handle strings of fairly considerable length. So you can replace that central while loop with this:
> > text = fp.read()
> > results =re.findall('href="https://www.safeweb.com/o/_o(410):_win(1):_w:_base(http://lw3fd.law3.hotmail.msn.com/cgi-bin/dasp/EN/):_bs
> http://lw3fd.law3.hotmail.msn.com/cgi-bin/dasp/EN/):(.*?)"', text)
> >Hope that this helps!
> >Jeff Shannon
> >Credit International
> >Tutor maillist - Tutor@python.org
> Get your FREE download of MSN Explorer at http://explorer.msn.com