Parsing HTML (continued)

Charlie Clark charlie at begeistert.org
Fri Aug 10 14:55:37 EDT 2001


Danny Yoo gave me the following example:

>class EmphasisGlancer(htmllib.HTMLParser):
>    def __init__(self):
>        htmllib.HTMLParser.__init__(self,
>                                    formatter.NullFormatter())
>        self.in_bold = 0
>        self.in_underline = 0
>
>    def start_b(self, attrs):
>        print "Hey, I see a bold tag!"
>	     self.in_bold = 1
>
>    def end_b(self):
>        self.in_bold = 0
>
>    def start_u(self, attrs):
>        print "Hey, I see some underscored text!"
>        self.in_underline = 1
>
>    def end_u(self):
>        self.in_underline = 0
>
>
>    def start_blink(self, attrs):
>        print "Hey, this is some heinously blinking test... *grrrr*"
>
>
>    def handle_data(self, data):
>        if self.in_bold:
>             print "BOLD:", data
>        elif self.in_underline:
>             print "UNDERLINE:", data
>###
Well, I've had more success than I would have imagined possible but I'm
still struggling with some stuff in this sisyphian task. What I'm still
having difficulty with:

1) Nested tags
    <br> and html entities cause difficulties as they can be included
with impunity inside other tags. I've been setting flags and collecting
data only to get tripped up by <br> or an html-entity and seeing as I'm
parsing German text there a lot of those.

2) Doing work only on specific attributes
   I've written little string searches to fast forward in a page and
reduce the size of what has to be parsed. For the same reason I'd like
to be able to stop parsing on a specific event.

    I've now got a particularly nasty webpage which distributes its
relevant content in various blocks and triggering on simple anchors
catches too much data. How do I go about this? The specific example
would be checking the colour of a specific table cell:

<td height="40" bgcolor="eeeeff" width="50">

there doesn't seem to be predefined methods for tables in htmllib so do
they all get handled with "unknown tag"? Would the thing to do be to use
a def start_td or a do_td? and what do the _bgn methods do? The reason I
ask is because the example in the "Python standard library" works with
"anchor_bgn" and not "do_a" or "start_a"

I'm thinking along the lines of

self.text = 0      # flag for whether I need the text

def ....td(self, attrs)
    if self.bgcolor = "eeeeff":
        store data, nested_tags
    else: fast_foward(next_td)

many thanx,

Charlie



More information about the Python-list mailing list