Parsing HTML--looking for info/comparison of HTMLParser vs. htmllib modules.
wes weston
oweston at earthlink.net
Fri Jul 7 18:43:08 EDT 2006
from HTMLParser import HTMLParser
class MyHTMLParser(HTMLParser):
def __init__(self):
HTMLParser.__init__(self)
self.TokenList = []
def handle_data( self,data):
data = data.strip()
if data and len(data) > 0:
self.TokenList.append(data)
#print data
def GetTokenList(self):
return self.TokenList
try:
url = "http://....your url here.............."
f = urllib.urlopen(url)
res = f.read()
f.close()
except:
print "bad read"
return
h = MyHTMLParser()
h.feed(res)
tokensList = h.GetTokenList()
Kenneth McDonald wrote:
> I'm writing a program that will parse HTML and (mostly) convert it to
> MediaWiki format. The two Python modules I'm aware of to do this are
> HTMLParser and htmllib. However, I'm currently experiencing either real
> or conceptual difficulty with both, and was wondering if I could get
> some advice.
>
> The problem I'm having with HTMLParser is simple; I don't seem to be
> getting the actual text in the HTML document. I've implemented the
> do_data method of HTMLParser.HTMLParser in my HTMLParser subclass, but
> it never seems to receive any data. Is there another way to access the
> text chunks as they come along?
>
> HTMLParser would probably be the way to go if I can figure this out. It
> seems much simpler than htmllib, and satisfies my requirements.
>
> htmllib will write out the text data (using the AbstractFormatter and
> AbstractWriter), but my problem here is conceptual. I simply don't
> understand why all of these different "levels" of abstractness are
> necessary, nor how to use them. As an example, the html <i>text</i>
> should be converted to ''text'' (double single-quotes at each end) in my
> mediawiki markup output. This would obviously be easy to achieve if I
> simply had an html parse that called a method for each start tag, text
> chunk, and end tag. But htmllib calls the tag functions in HTMLParser,
> and then does more things with both a formatter and a writer. To me,
> both seem unnecessarily complex (though I suppose I can see the benefits
> of a writer before generators gave the opportunity to simply yield
> chunks of output to be processed by external code.) In any case, I don't
> really have a good idea of what I should do with htmllib to get my
> converted tags, and then content, and then closing converted tags,
> written out.
>
> Please feel free to point to examples, code, etc. Probably the simplest
> solution would be a way to process text content in HTMLParser.HTMLParser.
>
> Thanks,
> Ken
More information about the Python-list
mailing list