Python equivalent of lynx -dump?
USENET at questionexchange.com
Fri Apr 14 15:42:28 CEST 2000
The server only sends the raw HTML. If you want it formatted,
you need to format it yourself --
sort of. To retrieve the data from the server, you can use
urlopen from urllib. You could
alternatively use httplib, but that's generally only necessary
if you're doing something really
fancy and HTTP specific.
Once you've got the HTML, you can use htmllib to do the
parsing. It needs a "formatter",
which in turn needs a "writer" (see the fomatter module at
formatter module has an AbstractFormatter and a DumbWriter,
which are both pretty basic,
but reasonably close to what "lynx -dump" does. If you want
better formatting, you can
write your own formatter and/or writer.
Here's some sample code that does basically what you want. Not
that I use a StringIO,
since DumbWriter wants to write to a file, but you want the
value in a string:
from urllib import urlopen
# first, retrieve the HTML...
html = urlopen(url).read()
# create a "string file"...
outfile = StringIO()
# create a writer and formatter...
myWriter = formatter.DumbWriter(outfile)
myFormatter = formatter.AbstractFormatter(myWriter)
# now parse and format the HTML...
parser = htmllib.HTMLParser(myFormatter)
# get the formatted output
data = outfile.getValue()
This answer is courtesy of QuestionExchange.com
More information about the Python-list