[Tutor] Extract main text from HTML document

Mark Lawrence breamoreboy at gmail.com
Sun May 6 12:32:38 EDT 2018

On 05/05/18 18:59, Simon Connah wrote:
> Hi,
> I'm writing a very simple web scraper. It'll download a page from a
> website and then store the result in a database of some sort. The
> problem is that this will obviously include a whole heap of HTML,
> JavaScript and maybe even some CSS. None of which is useful to me.
> I was wondering if there was a way in which I could download a web
> page and then just extract the main body of text without all of the
> The title is obviously easy but the main body of text could contain
> all sorts of HTML and I'm interested to know how I might go about
> removing the bits that are not needed but still keep the meaning of
> the document intact.
> Does anyone have any suggestions on this front at all?
> Thanks for any help.
> Simon.

A combination of requests http://docs.python-requests.org/en/master/ and 
beautiful soup https://www.crummy.com/software/BeautifulSoup/bs4/doc/ 
should fit the bill.  Both are installable with pip and are regarded as 
best of breed.

My fellow Pythonistas, ask not what our language can do for you, ask
what you can do for our language.

Mark Lawrence

More information about the Tutor mailing list