parsing complex web pages

Wed Jun 18 17:16:03 EDT 2003

I sometimes parse very complex web pages like

  http://biz.yahoo.com/z/a/x/xom.html

to get all the quantitative information out of them.  I find that
parsing the text is easier than parsing the html, because there is so
much extra syntax in the html that requires writing complex regexs or
parsers, whereas with the text version simple string splits and finds
often suffice.

The current strategy I use is to call lynx via popen to convert the
page to text.  If you do:

  lynx -dump http://biz.yahoo.com/z/a/x/xom.html

you'll see that the text comes back for the most part in a nicely
formatted, readily parsable layout; eg, the layout of many of the
tables is preserved.

If you do one of the standard html2txt conversions using HTMLParser
and a DumbWriter, the text layout is not as well preserved (albeit
still parsable)

    from urllib import urlopen
    import htmllib, formatter

    class Catcher:
        def __init__(self):
            self.lines = []
        def write(self, line):
            self.lines.append(line)
        def __getitem__(self, index):
            return self.lines[index]
        def read(self):
            return ' '.join(self.lines)

    def html2txt( fh ):
        oh = Catcher()
        p = htmllib.HTMLParser(
            formatter.AbstractFormatter(formatter.DumbWriter(oh)))
        p.feed(fh.read())
        return oh.read()

    print html2txt(urlopen('http://biz.yahoo.com/z/a/x/xom.html'))

I am more or less happy with the lynx solution, but would be happier
with a pure python solution.

Is there a good python -> text solution that preserves table layouts
and other visual formatting?

What do people recommend for quick parsing of complicated web pages?

John Hunter