[Tutor] Help with Parsing HTML files [html/OOP]
Charlie Clark
Charlie Clark <charlie@begeistert.org>
Tue, 07 Aug 2001 17:45:03 +0200
>class EmphasisGlancer(htmllib.HTMLParser):
> def __init__(self):
> htmllib.HTMLParser.__init__(self,
> formatter.NullFormatter())
> self.in_bold = 0
> self.in_underline = 0
> def start_b(self, attrs):
> print "Hey, I see a bold tag!"
> self.in_bold = 1
> def end_b(self):
> self.in_bold = 0
> def start_u(self, attrs):
> print "Hey, I see some underscored text!"
> self.in_underline = 1
> def end_u(self):
> self.in_underline = 0
> def start_blink(self, attrs):
> print "Hey, this is some heinously blinking test... *grrrr*"
> def handle_data(self, data):
> if self.in_bold:
> print "BOLD:", data
> elif self.in_underline:
> print "UNDERLINE:", data
>I have not tested this code yet, but hopefully I haven't made too many
>typos. If you play around with it, you might find that sgmllib/htmllib
>isn't as bad as you think.
Getting there but my head is really hurting. I've been to the bookshop and
picked up Fredrik Lundh's book and tried to make sense of that :-(
I still don't understand when to use sgmllib and when to use htmllib. I don't
know whether they are good or bad at the moment.
I made my own sample parser based on this little snippet. Start_"tag" can be
set to do something, I guess do_"tag" can be used to use some logic on a tag
whether the attributes are okay and handle_data does things with the data
which is enclosed by the tag.
How do I deal with nested structures where the data I'm interested in goes
across several tags, some of which aren't properly closed in the source!?
Here's some sample source:
<font face="Arial" size = "2"><b>Runde</b> Rennen –<b>Zeit</b>
13:59:49:(MEZ)<b>Wetter</b>sonnig<br>Herzlich willkommen
<br><br><table border="0" cellspacing="0"><tr><td><img src="/images/
I thought I might be able to use the flag setting method to indicate the
start of the article which in this case in the font tag.
class TagFinder(htmllib.HTMLParser):
def __init__(self):
htmllib.HTMLParser.__init__(self, formatter.NullFormatter())
#why is the necessary? what does it do?
self.text = 0
slef.article = [] # collect an individual article
self.articles = [] # list of all articles
def start_font(self, attrs):
self.text = 1
# the font tag isn't closed so we'll reset the flag when the table starts
def start_table(self, args):
self.text = 0
def handle_data(self, data):
if self.text:
self.article.append(string.lstrip(data)) # spaces cause problems
elif self.article != []:
self.articles.append(" ".join(self.article))
# add to the list collection
self.article = []
content = # read in from a file, cleaned up and searched to the beginning
c = TagFinder()
open("out.txt", "w").write("\n-----\n".join(c.articles)) # nearly obfuscated
It seems to work but also to run slower than my previous version not that it
really matters in this case.
I would very much appreciate any comments on this as I am sure it could be