[Tutor] Extract main text from HTML document

Simon Connah scopensource at gmail.com
Sun May 6 06:47:25 EDT 2018


Thanks for the replies, everyone. Beautiful Soup looks like a good option.

My primary goal is to extract the main body text, the title and the
meta description from a web page and run it through one of the cloud
Natural Language processing services to find out some information that
I'd like to know and I'd like to do it to quite a few websites.

This is all for a little project I have in mind. I'm not even sure if
it'll work but it'll be fun to try. I might have to do some custom
work on top of what Beautiful Soup offers though as I need to get very
specific data in a certain format.

On 5 May 2018 at 22:43, boB Stepp <robertvstepp at gmail.com> wrote:
> On Sat, May 5, 2018 at 12:59 PM, Simon Connah <scopensource at gmail.com> wrote:
>
>> I was wondering if there was a way in which I could download a web
>> page and then just extract the main body of text without all of the
>> HTML.
>
> I do not have any experience with this, but I like to collect books.
> One of them [1] says on page 245:
>
> "Beautiful Soup is a module for extracting information from an HTML
> page (and is much better for this purpose than regular expressions)."
>
> I believe this topic has come up before on this list as well as the
> main Python list.  You may want to check it out.  It can be installed
> with pip.
>
> [1] "Automate the Boring Stuff with Python -- Practical Programming
> for Total Beginners" by Al Sweigart.
>
> HTH!
> --
> boB
> _______________________________________________
> Tutor maillist  -  Tutor at python.org
> To unsubscribe or change subscription options:
> https://mail.python.org/mailman/listinfo/tutor


More information about the Tutor mailing list