[Tutor] Help with Parsing HTML files [html/OOP]

Charlie Clark Charlie Clark <charlie@begeistert.org>
Tue, 07 Aug 2001 17:45:03 +0200


>class EmphasisGlancer(htmllib.HTMLParser):
>    def __init__(self):
>        htmllib.HTMLParser.__init__(self,
>                                    formatter.NullFormatter())
>        self.in_bold = 0
>        self.in_underline = 0
>
>    def start_b(self, attrs):
>        print "Hey, I see a bold tag!"
>	     self.in_bold = 1
>
>    def end_b(self):
>        self.in_bold = 0
>
>    def start_u(self, attrs):
>        print "Hey, I see some underscored text!"
>        self.in_underline = 1
>
>    def end_u(self):
>        self.in_underline = 0
>
>
>    def start_blink(self, attrs):
>        print "Hey, this is some heinously blinking test... *grrrr*"
>
>
>    def handle_data(self, data):
>        if self.in_bold:
>             print "BOLD:", data
>        elif self.in_underline:
>             print "UNDERLINE:", data
>###
>
>
>I have not tested this code yet, but hopefully I haven't made too many
>typos.  If you play around with it, you might find that sgmllib/htmllib
>isn't as bad as you think.

Getting there but my head is really hurting. I've been to the bookshop and 
picked up Fredrik Lundh's book and tried to make sense of that :-(

I still don't understand when to use sgmllib and when to use htmllib. I don't 
know whether they are good or bad at the moment.

I made my own sample parser based on this little snippet. Start_"tag" can be 
set to do something, I guess do_"tag" can be used to use some logic on a tag 
whether the attributes are okay and handle_data does things with the data 
which is enclosed by the tag.

How do I deal with nested structures where the data I'm interested in goes 
across several tags, some of which aren't properly closed in the source!?

Here's some sample source:
<font face="Arial" size = "2"><b>Runde</b> Rennen &#150;<b>Zeit</b> 
13:59:49:(MEZ)<b>Wetter</b>sonnig<br>Herzlich willkommen
<br><br><table border="0" cellspacing="0"><tr><td><img src="/images/
trans.gif"></td></tr></table>
<br>
....

I thought I might be able to use the flag setting method to indicate the 
start of the article which in this case in the font tag.

class TagFinder(htmllib.HTMLParser):
    def __init__(self):
        htmllib.HTMLParser.__init__(self, formatter.NullFormatter())
    #why is the necessary? what does it do?
    self.text = 0
    slef.article = []        # collect an individual article
    self.articles = []       # list of all articles

    def start_font(self, attrs):
        self.text = 1

    # the font tag isn't closed so we'll reset the flag when the table starts
    
    def start_table(self, args):
        self.text = 0

    def handle_data(self, data):
        if self.text:
            self.article.append(string.lstrip(data)) # spaces cause problems
        elif self.article != []:
            self.articles.append(" ".join(self.article))
                # add to the list collection
            self.article = []
        
content = # read in from a file, cleaned up and searched to the beginning
   
c = TagFinder()
c.feed(content)
c.close()
open("out.txt", "w").write("\n-----\n".join(c.articles))  # nearly obfuscated

It seems to work but also to run slower than my previous version not that it 
really matters in this case.

I would very much appreciate any comments on this as I am sure it could be 
improved!

Charlie