parsing complex web pages
jdhunter at ace.bsd.uchicago.edu
Wed Jun 18 23:16:03 CEST 2003
I sometimes parse very complex web pages like
to get all the quantitative information out of them. I find that
parsing the text is easier than parsing the html, because there is so
much extra syntax in the html that requires writing complex regexs or
parsers, whereas with the text version simple string splits and finds
The current strategy I use is to call lynx via popen to convert the
page to text. If you do:
lynx -dump http://biz.yahoo.com/z/a/x/xom.html
you'll see that the text comes back for the most part in a nicely
formatted, readily parsable layout; eg, the layout of many of the
tables is preserved.
If you do one of the standard html2txt conversions using HTMLParser
and a DumbWriter, the text layout is not as well preserved (albeit
from urllib import urlopen
import htmllib, formatter
self.lines = 
def write(self, line):
def __getitem__(self, index):
return ' '.join(self.lines)
def html2txt( fh ):
oh = Catcher()
p = htmllib.HTMLParser(
I am more or less happy with the lynx solution, but would be happier
with a pure python solution.
Is there a good python -> text solution that preserves table layouts
and other visual formatting?
What do people recommend for quick parsing of complicated web pages?
More information about the Python-list