[Tutor] Help with Parsing HTML files [html/OOP]

Charlie Clark Charlie Clark <charlie@begeistert.org>
Fri, 10 Aug 2001 20:55:09 +0200

Danny Yoo gave me the following example:

>class EmphasisGlancer(htmllib.HTMLParser):
>    def __init__(self):
>        htmllib.HTMLParser.__init__(self,
>                                    formatter.NullFormatter())
>        self.in_bold = 0
>        self.in_underline = 0
>    def start_b(self, attrs):
>        print "Hey, I see a bold tag!"
>	     self.in_bold = 1
>    def end_b(self):
>        self.in_bold = 0
>    def start_u(self, attrs):
>        print "Hey, I see some underscored text!"
>        self.in_underline = 1
>    def end_u(self):
>        self.in_underline = 0
>    def start_blink(self, attrs):
>        print "Hey, this is some heinously blinking test... *grrrr*"
>    def handle_data(self, data):
>        if self.in_bold:
>             print "BOLD:", data
>        elif self.in_underline:
>             print "UNDERLINE:", data
Well, I've had more success than I would have imagined possible but I'm still 
struggling with some stuff in this sisyphian task. What I'm still having 
difficulty with:

1) Nested tags
    <br> and html entities cause difficulties as they can be included with 
impunity inside other tags. I've been setting flags and collecting data only 
to get tripped up by <br> or an html-entity and seeing as I'm parsing German 
text there a lot of those.

2) Doing work only on specific attributes
   I've written little string searches to fast forward in a page and reduce 
the size of what has to be parsed. For the same reason I'd like to be able to 
stop parsing on a specific event.

    I've now got a particularly nasty webpage which distributes its relevant 
content in various blocks and triggering on simple anchors catches too much 
data. How do I go about this? The specific example would be checking the 
colour of a specific table cell:

<td height="40" bgcolor="eeeeff" width="50">

there doesn't seem to be predefined methods for tables in htmllib so do they 
all get handled with "unknown tag"? Would the thing to do be to use a def 
start_td or a do_td? and what do the _bgn methods do? The reason I ask is 
because the example in the "Python standard library" works with "anchor_bgn" 
and not "do_a" or "start_a"

I'm thinking along the lines of

self.text = 0      # flag for whether I need the text

def ....td(self, attrs)
    if self.bgcolor = "eeeeff":
        store data, nested_tags
    else: fast_foward(next_td)

many thanx,