[Tutor] How do I get text from an HTML document.
SA
sarmstrong13@mac.com
Wed, 14 Aug 2002 16:08:14 -0500
On 8/14/02 3:54 PM, "Magnus Lycka" <magnus@thinkware.se> wrote:
> Well, the following change would make your code a bit more
> general. Perhaps you should even skip the default parameter
> value and always supply start and end patterns.
>
> def getTextFromHTML(html, start="<!--Storytext-->", end="<!--/Storytext-->"):
> ...
> story = '(?sx)%s.+%s' % (start, end) # You don't need a raw string do you?
> ...
>
> Some day you want to extract text in the <head> or in the <body> etc.
>
> Another option could be to write it as a class with some methods. (Untested.)
>
> class Html:
> def __init__(self, html):
> self.html = html
> def crop(self, start, end):
> story = '(?sx)%s.+%s' % (start, end)
> self.html = re.findall(story, self.html)
> def text(self):
> data = StringIO.StringIO()
> fmt = formatter.AbstractFormatter(formatter.DumbWriter(data))
> parser = htmllib.HTMLParser(fmt)
> parser.feed("".join(self.html))
> return data.getvalue()
>
> h = Html(html)
> h.crop('<!--Storytext-->','<!--/Storytext-->')
> print h.text()
>
Ahh...(he says as the light goes on above his head)
Thank You. I had not even thought that far ahead yet. That would definitely
be beneficial.
Thanks.
SA
--
"I can do everything on my Mac I used to on my PC. Plus a lot more ..."
-Me