[Tutor] Extract main text from HTML document
Mark Lawrence
breamoreboy at gmail.com
Sun May 6 12:32:38 EDT 2018
On 05/05/18 18:59, Simon Connah wrote:
> Hi,
>
> I'm writing a very simple web scraper. It'll download a page from a
> website and then store the result in a database of some sort. The
> problem is that this will obviously include a whole heap of HTML,
> JavaScript and maybe even some CSS. None of which is useful to me.
>
> I was wondering if there was a way in which I could download a web
> page and then just extract the main body of text without all of the
> HTML.
>
> The title is obviously easy but the main body of text could contain
> all sorts of HTML and I'm interested to know how I might go about
> removing the bits that are not needed but still keep the meaning of
> the document intact.
>
> Does anyone have any suggestions on this front at all?
>
> Thanks for any help.
>
> Simon.
A combination of requests http://docs.python-requests.org/en/master/ and
beautiful soup https://www.crummy.com/software/BeautifulSoup/bs4/doc/
should fit the bill. Both are installable with pip and are regarded as
best of breed.
--
My fellow Pythonistas, ask not what our language can do for you, ask
what you can do for our language.
Mark Lawrence
More information about the Tutor
mailing list