question on HTMLParser and parser.feed()

Sat Dec 6 04:00:53 EST 2003

Stephen Briley wrote:

> I am satisfied with the HTMLparse of my htmlsource
> page.  But I am unable to save the output of

My guess is that in the long run you will be even more satisfied with the
HTMLParser in the HTMLParser module - it has a cleaner interface and can
handle XHTML.

> parser.feed(htmlsource).  When I type
> parser.feed(htmlsource) into the interpreter, the
> correct output streams across the screen.  But all of
> my attempts to capture this output to a variable are
> unsucessful (e.g. capt_text =
> parser.feed(htmlsource)).
> 
> What am I missing and how can I get this to work?
> Thanks in advance!
> 
> 
> from htmllib import HTMLParser
> from formatter import AbstractFormatter, DumbWriter
> parser = HTMLParser(AbstractFormatter(DumbWriter()))
> parser.feed(htmlsource)

You can provide a file object to the dumbwriter object to write the
formatter output to a file:

outstream = file("tmp.txt", "w")
parser = HTMLParser(AbstractFormatter(DumbWriter(outstream)))
parser.feed(htmlsource)
outstream.close()

When you don't want to store the output you can instead provide a StringIO
instance that behaves like a file, but does not store anything on disk:

# cStringIO contains the faster version of StringIO
from cStringIO import StringIO 
from htmllib import HTMLParser
from formatter import AbstractFormatter, DumbWriter
htmlsource = """
<html>
    <head><title>Hello world</title></head>
    <body>For demonstration purposes</body>
</html>
"""

outstream = StringIO()
parser = HTMLParser(AbstractFormatter(DumbWriter(outstream)))
parser.feed(htmlsource)
data = outstream.getvalue()
outstream.close()

# your code here, I just print it in uppercase
print data.upper()

Peter