HTML -> XML: Where's HtmlBuilder()?

F. GEIGER fgeiger at datec.at
Sun Mar 10 11:51:03 EST 2002


It's Sunday evening and I seem to have lost the forest for the trees. Guess
I need a break and, better yet, some hints...

Anyway:
I'd like to process a bunch of HTML pages using XSLT. To do so they need to
be "more XML like" (<br /> instead of <br> etc.).

I remembered an article from Dr. D. Mertz (Charming Python #2, July 2000),
where HtmlBuilder() is used for this.

But exec'ing 'from xml.dom.html_builder import HtmlBuilder' yields:

>>> from xml.dom.html_builder import HtmlBuilder
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
ImportError: No module named html_builder
>>>

I've installed PyXML 0.6.6 and 4Suite 0.11, so I thought I were equipped for
such tasks. A search over my Python inst dir brought this to light:

C:\Programme\Python21\xmldoc\demo\dom\html2html

Its contents are similar to the snippets in Mertz's article, i.e.
HtmlBuilder is used as well:

<snipped/>

# Construct an HtmlBuilder object and feed the data to it
b = HtmlBuilder()
b.feed(HTML_DATA)

# Get the newly-constructed document object
doc = b.document

# Output it as HTML
print "============"
print "HTML version"
w = HtmlWriter()
w.write(b.document)

# Output it as XML
print "\n==========="
print "XML version"
print doc.toxml()

<snipped/>


So several questions come to my mind:
Is HtmlBuilder() deprecated?
Which module was it replaced with?
What else could I use to convert HTML to XML?
Do I need additional modules for this?
Or am I already prepared w/o knowing it?
Or am I completely wrong? If so, how is this done in a
"stat-of-the-art-manner"?

Many thanks in advance and best regards
Franz GEIGER


P.S.:

For those who are curious - the big picture:

I'm writing several modules to process a bunch of HTML files into a site.
A NavBuilder takes a site structure definition (XML format), builds the
navigation tree for each page and merges it into it.
A Linker resolves especially formatted ref's to other pages into real HTML
ref's.
An Execer executes Python code snippets it finds in the pages (like PMZ etc.
does it).
A Shaper applies an XSL file to the HTML pages to make them all look equally
formatted (this is where I am stuck now). (Yes, I could achieve this with
CSS on the client side, but I don't want to.). Then it includes them into a
template file, containing stuff, which is equal for all pages (logos etc.)
(Yes, I could achieve this with frames, but I don't want to.)
Then all pages are ftp'ed to a web server.
All these modules are gathered by a wxPython GUI. So one can load a site,
edit some XML and XSL files, press a button to process and transfer it.







More information about the Python-list mailing list