[XML-SIG] xml / html parsing for webbot
uche.ogbuji@fourthought.com
uche.ogbuji@fourthought.com
Sun, 10 Dec 2000 06:32:03 -0700
> > Especially I found some pages which were generated by scripts, do
> > contain unmatched tags in the pages. How the two approaches handle
> > them?
>
> For that purpose, the DOM authors made special support for HTML. You
> normally need a special parser, one that is capable of processing
> HTML, and still building a DOM tree. PyXML now includes 4DOM, which, I
> believe, is capable of converting arbitrary HTML into a DOM tree.
Correct as usual, Martin, although Python's standard htmllib gets much of the
credit for wrangling unruly HTML.
Here's a little demo. It shows how to read in any HTML and print out shiny
XHTML. Basically, it has the functionality of the highly popular Tidy
(http://www.w3.org/People/Raggett/tidy/) or JTidy (http://lempinen.net/sami/jti
dy/) but with XHTML output (Can be easily modified to produce cleaned HTML
output)
[uogbuji@borgia one-offs]$ cat html-to-xhtml-converter.py
import sys
from xml.dom.ext.reader import HtmlLib
import xml.dom.ext
#set up a re-usable reader object
reader = HtmlLib.Reader()
#parse HTML ffrom file or URI given on command line. Return the DOM document
doc = reader.fromUri(sys.argv[1])
#Just for kicks, write it out as XHTML, i.e. all lowercase, XML syntax for
empty tags, all attributes with given value, etc.
xml.dom.ext.XHtmlPrettyPrint(doc)
[uogbuji@borgia one-offs]$ cat data/example-from-wsdl-xslt-article.html
<HTML>
<HEAD>
<TITLE>Service summary: EndorsementSearch</TITLE>
<META charset='UTF-8' HTTP-EQUIV='content-type' CONTENT='text/html'>
</HEAD>
<BODY STYLE='background: #ffffff'>
<H1>Service summary: EndorsementSearch</H1>
<HR>
<TABLE>
<THEAD>Service: EndorsementSearchService</THEAD>
<TBODY>
<TR>
<TD STYLE='background: #ccffff' COLSPAN='3'>
<I>snowboarding-info.com Endorsement Service</I>
</TD>
</TR>
<TR>
<TD>Port: </TD>
<TD STYLE='background: #ffccff'>http://www.snowboard-info.com/Endorse
mentSearch</TD>
<TD STYLE='background: #ff66ff'>SOAP</TD>
</TR>
</TBODY>
</TABLE>
</BODY>
</HTML>
[uogbuji@borgia one-offs]$ python html-to-xhtml-converter.py
data/example-from-wsdl-xslt-article.html
<?xml version='1.0' encoding='UTF-8'?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"DTD/xhtml1-strict.dtd">
<html xmlns = 'http://www.w3.org/1999/xhtml'>
<head>
<title/>Service summary: EndorsementSearch
<meta charset='UTF-8' http-equiv='content-type' content='text/html'/>
</head>
<body style='background: #ffffff'>
<h1>Service summary: EndorsementSearch</h1>
<hr/>
<table>
<thead/>Service: EndorsementSearchService
<tbody/>
<tr>
<td style='background: #ccffff' colspan='3'>
<i>snowboarding-info.com Endorsement Service</i>
</td>
</tr>
<tr>
<td>Port:</td>
<td style='background: #ffccff'>http://www.snowboard-info.com/EndorsementSe
arch</td>
<td style='background: #ff66ff'>SOAP</td>
</tr>
</table>
</body>
</html>
[uogbuji@borgia one-offs]$
--
Uche Ogbuji Principal Consultant
uche.ogbuji@fourthought.com +1 303 583 9900 x 101
Fourthought, Inc. http://Fourthought.com
4735 East Walnut St, Ste. C, Boulder, CO 80301-2537, USA
Software-engineering, knowledge-management, XML, CORBA, Linux, Python