[Tutor] How do I get text from an HTML document.

Wed, 14 Aug 2002 14:42:17 -0500

I reworked your code with some re stuff I saw elsewhere and got the
following to work(this is the code that is used to manipulate html in a file
called html):

def getTextFromHTML(html):
    data = StringIO.StringIO()
    story = r'''(?sx)<!--Storytext-->.+<!--/Storytext-->'''
    text = re.findall(story, html)
    text2 = string.join(text)
    fmt = formatter.AbstractFormatter(formatter.DumbWriter(data))
    parser = htmllib.HTMLParser(fmt)
    parser.feed(text2)
    return data.getvalue()

final = getTextFromHTML(html)

This works perfect. Thanks for your help. If anyone sees any way to optimize
this code, please let me know.

Thanks Again.
SA

-- 
"I can do everything on my Mac I used to on my PC. Plus a lot more ..."
-Me