how to get text between HTML tags with URLLIB??

Paolo G. Cantore paolo.cantore at freenet.de
Sun Aug 20 11:30:44 EDT 2000


Roy Katz wrote:

> 
>     def start_a( self, attrs ):
>         self.in_link = 1
>         self.linkbuf = link_type( attrs )
> 
>     def end_a( self ):
>         self.in_link = 0
>         self.linkbuf.name = self.strbuf
>         self.links.append( self.linkbuf )
>         self.strbuf = ''
> 
>     def handle_data( self, data ):
>         if self.in_link == 1:
>             self.strbuf = self.strbuf + data
> 
> This approach works fairly well;  furthermore, the 'in_link' flag
> ensures that strbuf will contain *only* the text between the <a href> and
> </a> tags.  There is a problem with this approach, however.  I meant for
> link_type.links to be a list of strings corresponding to the placement
> of the link within the bookmark heirarchy; however, given Netscape's
> bookmark format, I see that it will take me a lot more code than I
> thought.  I *am* building an in-memory model.  So why re-invent the
> wheel? You're right, I'll look at DOM, I just need a few examples of how
> to use it effectively.
> 
> Roey

Your in_link processing is already provided by the two parser-methods 
save_bgn() and save_end(). Your code would look like:

def start_a(self, attrs):
	self.save_bgn()
	self.linkbuf=link_type(attrs)

def end_a(self):
	self.linkbuf.name=self.save_end()
	self.links.append(self.linkbuf)

that's all
--



More information about the Python-list mailing list