[Tutor] Extract main text from HTML document
mats at wichmann.us
Sat May 5 17:39:37 EDT 2018
On 05/05/2018 11:59 AM, Simon Connah wrote:
> I'm writing a very simple web scraper. It'll download a page from a
> website and then store the result in a database of some sort. The
> problem is that this will obviously include a whole heap of HTML,
> I was wondering if there was a way in which I could download a web
> page and then just extract the main body of text without all of the
> The title is obviously easy but the main body of text could contain
> all sorts of HTML and I'm interested to know how I might go about
> removing the bits that are not needed but still keep the meaning of
> the document intact.
> Does anyone have any suggestions on this front at all?
there's so much prior art in this space it's not really worth
reinventing this, unless you're using it as an exercise to teach
yourself more Python (always a worth goal!)
Here's one guy's summary of _some_ of the existing practice, albeit
probably the best known.
More information about the Tutor