web page text extractor
miki.tebeka at gmail.com
Thu Jul 12 15:48:06 CEST 2007
> For a project, I need to develop a corpus of online news stories. I'm
> looking for an application that, given the url of a web page, "copies"
> the rendered text of the web page (not the source HTNL text), opens a
> text editor (Notepad), and displays the copied text for the user to
> examine and save into a text file. Graphics and sidebars to be
> ignored. The examples I have come across are much too complex for me
> to customize for this simple job. Can anyone lead me to the right
Going simple :)
from os import system
from sys import argv
OUTFILE = "geturl.txt"
system("lynx -dump %s > %s" % (argv, OUTFILE))
system("start notepad %s" % OUTFILE)
(You can find lynx at http://lynx.browser.org/)
Note the removing sidebars is a very difficult problem.
Search for "wrapper induction" to see some work on the subject.
Miki <miki.tebeka at gmail.com>
More information about the Python-list