[Tutor] How do I get text from an HTML document.
Magnus Lycka
magnus@thinkware.se
Wed, 14 Aug 2002 22:54:06 +0200
At 14:42 2002-08-14 -0500, SA wrote:
>I reworked your code with some re stuff I saw elsewhere and got the
>following to work(this is the code that is used to manipulate html in a=
file
>called html):
>
>def getTextFromHTML(html):
> data =3D StringIO.StringIO()
> story =3D r'''(?sx)<!--Storytext-->.+<!--/Storytext-->'''
> text =3D re.findall(story, html)
> text2 =3D string.join(text)
> fmt =3D formatter.AbstractFormatter(formatter.DumbWriter(data))
> parser =3D htmllib.HTMLParser(fmt)
> parser.feed(text2)
> return data.getvalue()
>
>final =3D getTextFromHTML(html)
>
>This works perfect. Thanks for your help. If anyone sees any way to=
optimize
>this code, please let me know.
Well, the following change would make your code a bit more
general. Perhaps you should even skip the default parameter
value and always supply start and end patterns.
def getTextFromHTML(html, start=3D"<!--Storytext-->",=
end=3D"<!--/Storytext-->"):
...
story =3D '(?sx)%s.+%s' % (start, end) # You don't need a raw string do=
you?
...
Some day you want to extract text in the <head> or in the <body> etc.
Another option could be to write it as a class with some methods.=
(Untested.)
class Html:
def __init__(self, html):
self.html =3D html
def crop(self, start, end):
story =3D '(?sx)%s.+%s' % (start, end)
self.html =3D re.findall(story, self.html)
def text(self):
data =3D StringIO.StringIO()
fmt =3D formatter.AbstractFormatter(formatter.DumbWriter(data))
parser =3D htmllib.HTMLParser(fmt)
parser.feed("".join(self.html))
return data.getvalue()
h =3D Html(html)
h.crop('<!--Storytext-->','<!--/Storytext-->')
print h.text()
--=20
Magnus Lyck=E5, Thinkware AB
=C4lvans v=E4g 99, SE-907 50 UME=C5
tel: 070-582 80 65, fax: 070-612 80 65
http://www.thinkware.se/ mailto:magnus@thinkware.se