[Tutor] Help with Parsing HTML files [html/OOP]
Charlie Clark
Charlie Clark <charlie@begeistert.org>
Fri, 10 Aug 2001 20:55:09 +0200
Danny Yoo gave me the following example:
>class EmphasisGlancer(htmllib.HTMLParser):
> def __init__(self):
> htmllib.HTMLParser.__init__(self,
> formatter.NullFormatter())
> self.in_bold = 0
> self.in_underline = 0
>
> def start_b(self, attrs):
> print "Hey, I see a bold tag!"
> self.in_bold = 1
>
> def end_b(self):
> self.in_bold = 0
>
> def start_u(self, attrs):
> print "Hey, I see some underscored text!"
> self.in_underline = 1
>
> def end_u(self):
> self.in_underline = 0
>
>
> def start_blink(self, attrs):
> print "Hey, this is some heinously blinking test... *grrrr*"
>
>
> def handle_data(self, data):
> if self.in_bold:
> print "BOLD:", data
> elif self.in_underline:
> print "UNDERLINE:", data
>###
Well, I've had more success than I would have imagined possible but I'm still
struggling with some stuff in this sisyphian task. What I'm still having
difficulty with:
1) Nested tags
<br> and html entities cause difficulties as they can be included with
impunity inside other tags. I've been setting flags and collecting data only
to get tripped up by <br> or an html-entity and seeing as I'm parsing German
text there a lot of those.
2) Doing work only on specific attributes
I've written little string searches to fast forward in a page and reduce
the size of what has to be parsed. For the same reason I'd like to be able to
stop parsing on a specific event.
I've now got a particularly nasty webpage which distributes its relevant
content in various blocks and triggering on simple anchors catches too much
data. How do I go about this? The specific example would be checking the
colour of a specific table cell:
<td height="40" bgcolor="eeeeff" width="50">
there doesn't seem to be predefined methods for tables in htmllib so do they
all get handled with "unknown tag"? Would the thing to do be to use a def
start_td or a do_td? and what do the _bgn methods do? The reason I ask is
because the example in the "Python standard library" works with "anchor_bgn"
and not "do_a" or "start_a"
I'm thinking along the lines of
self.text = 0 # flag for whether I need the text
def ....td(self, attrs)
if self.bgcolor = "eeeeff":
store data, nested_tags
else: fast_foward(next_td)
many thanx,
Charlie