Best DOM-like HTML parser?

Tue May 20 13:56:21 EDT 2003

"MAK" <mike at mmrd.com> wrote:

> I have a need to read and parse an HTML page (containing a table,
> which is the data I am after).  If this were XML, I'd be using the
> effbot's ElementTree, but this is plain-ol' HTML.  Is there anything
> out there as simple and easy to use as ElementTree, but for HTML?  Or
> am I going the wrong way?

note that the elementtree package comes with two HTML parsers,
which both read HTML into a XHTML-ish element trees.

the HTMLTreeBuilder is relatively picky, and uses a simple-minded
approach to automatically close certain elements, and add missing
end tags where necessary.  usage:

    from elementtree import ElementTree, HTMLTreeBuilder

    # file is either a filename or an open stream
    tree = ElementTree.parse(file, parser=HTMLTreeBuilder.TreeBuilder())
    root = tree.getroot()

or

    from elementtree import HTMLTreeBuilder

    parser = HTMLTreeBuilder.TreeBuilder()
    parser.feed(data)
    root = parser.close()

the TidyTools module is a lot more competent, but relies on an
external utility:

    http://www.w3.org/People/Raggett/tidy/
    http://tidy.sourceforge.net/

usage:

    from elementtree import TidyTools

    tree = TidyTools.tidy(filename)
    root = tree.getroot()

the "tidy" function uses the external utility to turn your HTML file
into an XHTML document, and loads the result into an element tree.

note that the elements will all live in the XHTML namespace; to
get rid of that (so you can access the table element as "table"
instead of "{http://www.w3.org/1999/xhtml}table"), you can use
something like:

    NS = "{http://www.w3.org/1999/xhtml}"

    for node in tree.getiterator():
        if node.tag.startswith(NS):
            node.tag = node.tag[len(NS):]

</F>