[Tutor] Extract main text from HTML document
boB Stepp
robertvstepp at gmail.com
Sat May 5 17:43:01 EDT 2018
On Sat, May 5, 2018 at 12:59 PM, Simon Connah <scopensource at gmail.com> wrote:
> I was wondering if there was a way in which I could download a web
> page and then just extract the main body of text without all of the
> HTML.
I do not have any experience with this, but I like to collect books.
One of them [1] says on page 245:
"Beautiful Soup is a module for extracting information from an HTML
page (and is much better for this purpose than regular expressions)."
I believe this topic has come up before on this list as well as the
main Python list. You may want to check it out. It can be installed
with pip.
[1] "Automate the Boring Stuff with Python -- Practical Programming
for Total Beginners" by Al Sweigart.
HTH!
--
boB
More information about the Tutor
mailing list