[Tutor] How do I get text from an HTML document.
Magnus Lycka
magnus@thinkware.se
Wed, 14 Aug 2002 22:39:45 +0200
At 14:18 2002-08-14 -0500, SA wrote:
>It does a great job of extracting the text between the two tags. But for
>some reason I have a lot of extraneous material after the text that was not
>between the two tags and looks like it may have come from the html code
>after the end tag. Did I miss something?
>
> >>> def getTextFromHTML(html, startPattern, endPattern):
>... data =3D StringIO.StringIO()
>... start =3D html.find(startPattern)
>... stop =3D html.find(endPattern, start + 1)
>... fmt =3D formatter.AbstractFormatter(formatter.DumbWriter(data))
>... parser =3D htmllib.HTMLParser(fmt)
>... parser.feed(html[start:stop])
>... return data
That suggests that "stop =3D html.find(endPattern, start + 1)"
didn't work as intended? Does stop come out as -1?
>>> def getTextFromHTML(html, startPattern, endPattern):
... data =3D StringIO.StringIO()
... start =3D html.find(startPattern)
... stop =3D html.find(endPattern, start + 1)
Here we could put in
print stop
print html[start:stop]
to see what that looks like...
... fmt =3D formatter.AbstractFormatter(formatter.DumbWriter(data))
... parser =3D htmllib.HTMLParser(fmt)
... parser.feed(html[start:stop])
... return data
Hm, I think I see what it was: Did you supply "<!--/Storytext-->"
as the "endPattern", not "<!--Storytext-->". In your question you
had "<!--Story-->" both before and after, so I called the function
as "getTextFromHTML(html, tag, tag)" With the same tag for before
and after.
--=20
Magnus Lyck=E5, Thinkware AB
=C4lvans v=E4g 99, SE-907 50 UME=C5
tel: 070-582 80 65, fax: 070-612 80 65
http://www.thinkware.se/ mailto:magnus@thinkware.se