[Tutor] Extract main text from HTML document

Sat May 5 17:43:01 EDT 2018

On Sat, May 5, 2018 at 12:59 PM, Simon Connah <scopensource at gmail.com> wrote:

> I was wondering if there was a way in which I could download a web
> page and then just extract the main body of text without all of the
> HTML.

I do not have any experience with this, but I like to collect books.
One of them [1] says on page 245:

"Beautiful Soup is a module for extracting information from an HTML
page (and is much better for this purpose than regular expressions)."

I believe this topic has come up before on this list as well as the
main Python list.  You may want to check it out.  It can be installed
with pip.

[1] "Automate the Boring Stuff with Python -- Practical Programming
for Total Beginners" by Al Sweigart.

HTH!
-- 
boB