[Tutor] How do I get text from an HTML document.

Wed, 14 Aug 2002 16:08:14 -0500

On 8/14/02 3:54 PM, "Magnus Lycka" <magnus@thinkware.se> wrote:
> Well, the following change would make your code a bit more
> general. Perhaps you should even skip the default parameter
> value and always supply start and end patterns.
> 
> def getTextFromHTML(html, start="<!--Storytext-->", end="<!--/Storytext-->"):
>    ...
>    story = '(?sx)%s.+%s' % (start, end) # You don't need a raw string do you?
>    ...
> 
> Some day you want to extract text in the <head> or in the <body> etc.
> 
> Another option could be to write it as a class with some methods. (Untested.)
> 
> class Html:
>    def __init__(self, html):
>        self.html = html
>    def crop(self, start, end):
>        story = '(?sx)%s.+%s' % (start, end)
>        self.html = re.findall(story, self.html)
>    def text(self):
>        data = StringIO.StringIO()
>        fmt = formatter.AbstractFormatter(formatter.DumbWriter(data))
>        parser = htmllib.HTMLParser(fmt)
>        parser.feed("".join(self.html))
>        return data.getvalue()
> 
> h = Html(html)
> h.crop('<!--Storytext-->','<!--/Storytext-->')
> print h.text()
> 
Ahh...(he says as the light goes on above his head)

Thank You. I had not even thought that far ahead yet. That would definitely
be beneficial.

Thanks.

SA

-- 
"I can do everything on my Mac I used to on my PC. Plus a lot more ..."
-Me