[XML-SIG] xml / html parsing for webbot

Martin v. Loewis martin@loewis.home.cs.tu-berlin.de
Sun, 10 Dec 2000 11:48:39 +0100


> 1. xml.dom.walker and xml.dom.writer is missing in python 2.0 's xml
> package. What are their usage?

Indeed. These classes originate from PyDOM, which is obsolete. In
Python 2.0, only minidom is included. There is no equivalent of a
walker class in minidom. Instead of a writer, you can probably use
.toxml() in most cases.

> I have think of not building a dom tree but using regular
> expressions to extract all links. Can somebody tell me from their
> experience some comparision of the two approaches? What is better?

In principle, an approach using regular expressions could fail more
easily than a solution that really analysis the structure of the
document. For most practical purposes, the solution using regular
expressions will work just fine. In the end, all that matters is that
it works.

> Especially I found some pages which were generated by scripts, do
> contain unmatched tags in the pages. How the two approaches handle
> them?

For that purpose, the DOM authors made special support for HTML. You
normally need a special parser, one that is capable of processing
HTML, and still building a DOM tree. PyXML now includes 4DOM, which, I
believe, is capable of converting arbitrary HTML into a DOM tree.

Regards,
Martin