[Tutor] How do I get text from an HTML document.
SA
sarmstrong13@mac.com
Wed, 14 Aug 2002 14:42:17 -0500
I reworked your code with some re stuff I saw elsewhere and got the
following to work(this is the code that is used to manipulate html in a file
called html):
def getTextFromHTML(html):
data = StringIO.StringIO()
story = r'''(?sx)<!--Storytext-->.+<!--/Storytext-->'''
text = re.findall(story, html)
text2 = string.join(text)
fmt = formatter.AbstractFormatter(formatter.DumbWriter(data))
parser = htmllib.HTMLParser(fmt)
parser.feed(text2)
return data.getvalue()
final = getTextFromHTML(html)
This works perfect. Thanks for your help. If anyone sees any way to optimize
this code, please let me know.
Thanks Again.
SA
--
"I can do everything on my Mac I used to on my PC. Plus a lot more ..."
-Me