[Tutor] How do I get text from an HTML document.

Magnus Lycka magnus@thinkware.se
Wed, 14 Aug 2002 22:54:06 +0200


At 14:42 2002-08-14 -0500, SA wrote:
>I reworked your code with some re stuff I saw elsewhere and got the
>following to work(this is the code that is used to manipulate html in a=
 file
>called html):
>
>def getTextFromHTML(html):
>     data =3D StringIO.StringIO()
>     story =3D r'''(?sx)<!--Storytext-->.+<!--/Storytext-->'''
>     text =3D re.findall(story, html)
>     text2 =3D string.join(text)
>     fmt =3D formatter.AbstractFormatter(formatter.DumbWriter(data))
>     parser =3D htmllib.HTMLParser(fmt)
>     parser.feed(text2)
>     return data.getvalue()
>
>final =3D getTextFromHTML(html)
>
>This works perfect. Thanks for your help. If anyone sees any way to=
 optimize
>this code, please let me know.

Well, the following change would make your code a bit more
general. Perhaps you should even skip the default parameter
value and always supply start and end patterns.

def getTextFromHTML(html, start=3D"<!--Storytext-->",=
 end=3D"<!--/Storytext-->"):
     ...
     story =3D '(?sx)%s.+%s' % (start, end) # You don't need a raw string do=
 you?
     ...

Some day you want to extract text in the <head> or in the <body> etc.

Another option could be to write it as a class with some methods.=
 (Untested.)

class Html:
     def __init__(self, html):
         self.html =3D html
     def crop(self, start, end):
         story =3D '(?sx)%s.+%s' % (start, end)
         self.html =3D re.findall(story, self.html)
     def text(self):
         data =3D StringIO.StringIO()
         fmt =3D formatter.AbstractFormatter(formatter.DumbWriter(data))
         parser =3D htmllib.HTMLParser(fmt)
         parser.feed("".join(self.html))
         return data.getvalue()

h =3D Html(html)
h.crop('<!--Storytext-->','<!--/Storytext-->')
print h.text()


--=20
Magnus Lyck=E5, Thinkware AB
=C4lvans v=E4g 99, SE-907 50 UME=C5
tel: 070-582 80 65, fax: 070-612 80 65
http://www.thinkware.se/  mailto:magnus@thinkware.se