[XML-SIG] minidom w/ HTML
cbearden at hal-pc.org
Thu Jun 24 10:49:07 EDT 2004
On Mon, Jun 21, 2004 at 12:25:59PM -0700, jennyw wrote:
> I have a project where I need to parse html files that are table heavy
> (a calendar, actually), and I thought minidom would be perfect for my
> needs. The problem is that the HTML that I'm trying to parse isn't quite
> valid XML -- mostly minor things, but enough so that minidom won't work.
> Is there a something that would convert an html file into XML that
> would work with minidom? Or is there something better, like something
> more geared towards html that I should be looking at?
> The reason I thought of minidom is because I want to easily be able to
> navigate through table cells. Basically, it's a weekly calendar, and
> there's a table that has cells for each day. Inside each day cell, there
> are cells for time and for the name of the event. There are other ways
> to do this, but I'd like to learn more about parsing XML documents and
> thought this would be a good way accomplish my immediate needs and learn
> something new.
I have used a combination one of the Python tidy implementations
together with the microdom from the Twisted framework. When
creating a Twisted microdom, the 'parseString' method takes an optional
argument 'beExtremelyLenient', which does just what it says. Some HTML
has flaws so serious (e.g. unbalanced quotes in attribute values) that
these must be corrected before tidying. You can imagine a three-step
(1) ad hoc fixing of HTML problems, if necessary;
(2) creating "tidied" version of HTML doc;
(3) creating extremely lenient twisted.web.microdom object.
Itamar Shtull-Trauring has an introductory article on the Twisted
microdom at O'Reilly's XML.com.
Hope this helps,
More information about the XML-SIG