mangled attempt at using htmllib

Steve Holden sholden at holdenweb.com
Wed Oct 11 15:33:07 EDT 2000


Ari Davidow wrote:
> 
> Well,
> 
> I think I'm very lost again. The idea was to parse a file which
> contains lots of lines in the following format:
> 
> 200 OK         <a href="urlstatusgo.html?col=test&url=http%
> 3A//www.foobar.com/archive/091400.html">http://www.foobar.com/archive/09
> 1400.html</a>
> 
Well, I think your first misapprehension is that you appear to be
expecting HTTP back form the urllib readlines() call, when in fact
the HTTP is stripped off, and what *you* see is just the HTML!

> What I thought I was doing was looking for lines that began with
> the "200 OK        '
> string, then using htmllib to return the text between the anchor tags.

Good strategy, but htmllib has a couple of quirks, the most notable of which
is that anchors are NOT handled by start_a and end_a, but by anchor_bgn
and anchor_end.

> If I run the program, below, I get the following error:
> 
> D:\Program Files\Python\work>python filtertest4.py
> 
> Traceback (innermost last):
>   File "filtertest4.py", line 58, in ?
>     print '%s' % process(line)
>   File "filtertest4.py", line 36, in process
>     text2print = parser.getLink
> AttributeError: getLink
> 
> So, two questions. First, how do I do this sanely.

I've hacked your program somewhat to try and do what I think you were
trying to do!  Note that the default action of the htmllib parser is
to build up a list of anchors' HREF attributes in something called
anchorlist, so the default action may actually be all you need if you
are trying to crawl a web site!

My code appears below.  I have annotated yours to highlight a few of
the misapprehensions you need to clear up.

>						 Second, when I get
> an AttributeError, is there a way to find out what Attributes were
> being expected, so I can adjust whatever I'm doing accordingly?
> 
Well, Python is doing its best!  As a response to your statement

     text2print = parser.getLink

Python says:

AttributeError: getLink

which is a pretty clear indication you have tried to access a non-
existent attribute called (guess what) getLink!

> ---------------------------------------------------------------
> import re, htmllib,formatter,string
> import exceptions,string,sys,traceback,time
> import DateTime
> import ODBC,ODBC.Windows,ODBC.Misc.proc
> import urllib
> 
> class seekUrl(htmllib.HTMLParser):
> 
>     def __init__(self):
>         self.start=0
>         self.current_data=''
>         htmllib.HTMLParser.__init__(
>             self, formatter.NullFormatter())
> 
###
### MAJOR PROBLEM: The functions below re all indented, which means that they
### get redefined every time you create a seekURL, but they are not visible
### as attributes of the object since their scope is the __init__ function.
###
>         def start_a(self,attributes):
>                 self.current_data='boo!'
>         def end_a(self):
>                 pass
>         def handle_data(self, data):
>                 self.current_data=data
>         def getLink(self):
>                 return self.current_data
> 
### And, of course, that's why you got the attribute error!

> def process(stuff):
>     parser=seekUrl()
>     parser.feed(stuff)
>     text2print = parser.getLink
> 
### The basic logic above is good, but the parser is really built to
### gobble whole files, not bits and pieces, and if you miss chunks
### out of the HTML you have, it will get confused.

> def showparse(filename):
>     pprint.pprint(process(stuff))
> 
> currentFilePrefix = 'http://'
> fileDict = {'www.cio.com':'webbusiness.cio.com'}
> testOutFile = 'd:\\program files\\python\\work\\ciodump.html'
### I presume the next line is left over form some testing?
> parser=seekUrl()
###
### I think it's safer to create a new parser each time you parse a file.
### I don;t think there's any guarantee of reusability, and of course if
### there are syntax errors in one file they'll affect the next one!
###
> 
> for root_url,currentFile in fileDict.items():
>         currentFile = currentFilePrefix+currentFile+'/'
>         current = urllib.urlopen(currentFile)
>         inFile = current.readlines()
> 
### This gives you a list of lines.  Much cleaner to use read(), which
### returns the whole shebang to be fed into the parser.
>         for line in inFile:
>                 if re.search('200 OK         ',line):
>                         print '%s' % process(line)
### This seems to assume that each line is self-standing, correctly-
### formatted HTML.  Most pages won't work like that, so you'll lose.
> #                       foo = parser.myUrl()
> #                       print '%s' % foo
> current.close()
### Should be indented to execute each time around the "for" loop.
> # don't forget to clean up after yourself.
> urllib.urlcleanup()
> ------------------------------------------
> 
> Many, many thanks,
> ari
> 
> --
> Ari Davidow
> ari at ivritype.com
> 
> Sent via Deja.com http://www.deja.com/
> Before you buy.
OK, here's some code that will show you each piece of text, followed
by the URL it links to.  Note that it might be simpler just to use a
simple HTMLParser, and then access its anchorlist attribute after you've
stuffed the data through the parser.

I've left your unused import in, assuming you will use them later.

Hope this helps!

regards
 Steve

--

import re, htmllib,formatter,string
import exceptions,string,sys,traceback,time
import DateTime
import ODBC,ODBC.Windows,ODBC.Misc.proc
import urllib

class seekUrl(htmllib.HTMLParser):

    def __init__(self):
        htmllib.HTMLParser.__init__(self, formatter.NullFormatter())
        self.start = 0
        self.alist = []
        self.c_data = ""

    def anchor_bgn(self,href,name,type):
        print "start_a href:", href, "\nname:", name, "\ntype:", type
        self.href = href
        self.c_data = ""

    def anchor_end(self):
        self.alist.append([self.href, self.c_data])

    def handle_data(self, data):
            self.c_data=self.c_data+data

    #def getLink(self):
    #        return self.c_data

    def close(self):
            return self.alist

def process(stuff):
    parser=seekUrl()
    parser.feed(stuff)
    text2print = parser.getLink

def showparse(filename):
    pprint.pprint(process(stuff))


currentFilePrefix = 'http://'
fileDict = {'My Web Site':'www.holdenweb.com'}
testOutFile = 'd:\\temp\\ciodump.html'

for root_url,currentFile in fileDict.items():
    parser=seekUrl()
    currentFile = currentFilePrefix+currentFile+'/'
    current = urllib.urlopen(currentFile)
    inFile = current.read()    
    parser.feed(inFile)        
    current.close()
    l = parser.close()
    for anchor in l:
        print anchor[1], "links to:", anchor[0]
    
# don't forget to clean up after yourself.
urllib.urlcleanup()
-- 
Helping people meet their information needs with training and technology.
703 967 0887      sholden at bellatlantic.net      http://www.holdenweb.com/





More information about the Python-list mailing list