Parsing HTML

Jeff Rush jrush at summit-research.com
Fri Jun 18 03:04:55 EDT 1999


On Thu, 17 Jun 1999 13:27:33, Mordy Ovits <movits at lockstar.com> wrote:

> What is the best way to parse HTML into a Python data structure, allow me change it,
> and output it as HTML?
> -- 
> o Mordy Ovits
> o Cryptographic Engineer
> o LockStar Inc.

One good way is to use the DOM framework of XML.  You can get the module
and docs at www.python.org in the XML-SIG section.  A tiny example follows:

----- cut here -----

#!/usr/bin/env python

from xml.dom.html_builder import HtmlBuilder
from xml.dom.writer       import HtmlWriter

# Read in Original HTML Source
# and Build a DOM Tree Structure
# ----------------------------

htmlstr = open('test.html', 'r').read()
b = HtmlBuilder(ignore_mismatched_end_tags=1)
b.feed(htmlstr)   # Stuff the HTML Source into the HTML Parser
doc = b.document  # Get the Newly Constructed Document Object

# Perform Modifications as Needed
# -------------------------------

text = doc.createTextNode("Additional Title Text")
titlenode = doc.getElementsByTagName('TITLE')[0]
titlenode.appendChild(text)

# Write DOM Tree Back as HTML Source
# --------------------------------

fd = open('output.html', 'w')
w = HtmlWriter(stream=fd)
w.write(doc)

----- cut here -----

There is also an XML/DOM based HTML Pretty Printer posted
in here a week ago, which I find useful to clean up the output
HTML.  If you can't find it, drop me a line.

-Jeff Rush





More information about the Python-list mailing list