[XML-SIG] xml / html parsing for webbot

uche.ogbuji@fourthought.com uche.ogbuji@fourthought.com
Sun, 10 Dec 2000 06:51:59 -0700


> > For that purpose, the DOM authors made special support for HTML. You
> > normally need a special parser, one that is capable of processing
> > HTML, and still building a DOM tree. PyXML now includes 4DOM, which, I
> > believe, is capable of converting arbitrary HTML into a DOM tree.
> 
> Logilab contributed a much improved version of FromHtml to 4DOM a while
> ago which was included in 4Suite 0.9.2 I think. I don't know which version
> is shipped in PyXml 0.6.2, though. If you need this piece of code, and
> can't find it in your distribution, jsut ask.

This was after PyXML 0.6.2, so it's not included.  We have a few improvements 
to make yet to 4DOM before we release 4Suite 0.10.1 in a few weeks.  Are there 
any plans on the horizon to release PyXML 0.6.3?  If so, we'll get all the 
changes in before then.

I should note that the code from Logilab meticulously sets up the HTML content 
model according to spec.  It's a brilliant piece of work.  However, in many 
cases of HTML usage you would be able to get by just fine with the DOM code in 
PyXML 0.6.2.  If you start to run into problems, you might want to install 
4Suite 0.10.0 which includes LogiLab's code and many other fixes.


-- 
Uche Ogbuji                               Principal Consultant
uche.ogbuji@fourthought.com               +1 303 583 9900 x 101
Fourthought, Inc.                         http://Fourthought.com 
4735 East Walnut St, Ste. C, Boulder, CO 80301-2537, USA
Software-engineering, knowledge-management, XML, CORBA, Linux, Python