web page text extractor

kublai restycena at gmail.com
Thu Jul 12 12:09:43 EDT 2007


On Jul 12, 10:22 pm, Jon Rosebaugh <j... at turnthepage.org> wrote:
> On 2007-07-12 04:42:25 -0500, kublai <restyc... at gmail.com> said:
>
> > For a project, I need to develop a corpus of online news stories.  I'm
> > looking for an application that, given the url of a web page, "copies"
> > the rendered text of the web page (not the source HTNL text), opens a
> > text editor (Notepad), and displays the copied text for the user to
> > examine and save into a text file. Graphics and sidebars to be
> > ignored. The examples I have come across are much too complex for me
> > to customize for this simple job. Can anyone lead me to the right
> > direction?
>
> You may find BeautifulSoup or templatemaker to be of assistance:
>
> http://www.crummy.com/software/BeautifulSoup/http://www.holovaty.com/blog/archive/2007/07/06/0128

Thanks all for your suggestions. I will try first the Lynx solution.

Cheers,
gk




More information about the Python-list mailing list