[Tutor] Extract main text from HTML document
scopensource at gmail.com
Mon May 7 07:05:15 EDT 2018
That looks like a useful combination. Thanks.
On 6 May 2018 at 17:32, Mark Lawrence <breamoreboy at gmail.com> wrote:
> On 05/05/18 18:59, Simon Connah wrote:
>> I'm writing a very simple web scraper. It'll download a page from a
>> website and then store the result in a database of some sort. The
>> problem is that this will obviously include a whole heap of HTML,
>> I was wondering if there was a way in which I could download a web
>> page and then just extract the main body of text without all of the
>> The title is obviously easy but the main body of text could contain
>> all sorts of HTML and I'm interested to know how I might go about
>> removing the bits that are not needed but still keep the meaning of
>> the document intact.
>> Does anyone have any suggestions on this front at all?
>> Thanks for any help.
> A combination of requests http://docs.python-requests.org/en/master/ and
> beautiful soup https://www.crummy.com/software/BeautifulSoup/bs4/doc/ should
> fit the bill. Both are installable with pip and are regarded as best of
> My fellow Pythonistas, ask not what our language can do for you, ask
> what you can do for our language.
> Mark Lawrence
> Tutor maillist - Tutor at python.org
> To unsubscribe or change subscription options:
More information about the Tutor