[XML-SIG] xml / html parsing for webbot
uche.ogbuji@fourthought.com
uche.ogbuji@fourthought.com
Sun, 10 Dec 2000 06:51:59 -0700
> > For that purpose, the DOM authors made special support for HTML. You
> > normally need a special parser, one that is capable of processing
> > HTML, and still building a DOM tree. PyXML now includes 4DOM, which, I
> > believe, is capable of converting arbitrary HTML into a DOM tree.
>
> Logilab contributed a much improved version of FromHtml to 4DOM a while
> ago which was included in 4Suite 0.9.2 I think. I don't know which version
> is shipped in PyXml 0.6.2, though. If you need this piece of code, and
> can't find it in your distribution, jsut ask.
This was after PyXML 0.6.2, so it's not included. We have a few improvements
to make yet to 4DOM before we release 4Suite 0.10.1 in a few weeks. Are there
any plans on the horizon to release PyXML 0.6.3? If so, we'll get all the
changes in before then.
I should note that the code from Logilab meticulously sets up the HTML content
model according to spec. It's a brilliant piece of work. However, in many
cases of HTML usage you would be able to get by just fine with the DOM code in
PyXML 0.6.2. If you start to run into problems, you might want to install
4Suite 0.10.0 which includes LogiLab's code and many other fixes.
--
Uche Ogbuji Principal Consultant
uche.ogbuji@fourthought.com +1 303 583 9900 x 101
Fourthought, Inc. http://Fourthought.com
4735 East Walnut St, Ste. C, Boulder, CO 80301-2537, USA
Software-engineering, knowledge-management, XML, CORBA, Linux, Python