[XML-SIG] minidom w/ HTML
Fred L. Drake, Jr.
fdrake at acm.org
Thu Jun 24 11:00:23 EDT 2004
On Monday 21 June 2004 03:25 pm, jennyw wrote:
> I have a project where I need to parse html files that are table heavy
> (a calendar, actually), and I thought minidom would be perfect for my
> needs. The problem is that the HTML that I'm trying to parse isn't quite
> valid XML -- mostly minor things, but enough so that minidom won't work.
I wouldn't generally expect HTML to be parsable as XML, only XHTML.
> Is there a something that would convert an html file into XML that
> would work with minidom? Or is there something better, like something
> more geared towards html that I should be looking at?
You could run the HTML through HTML Tidy before parsing it as XML. This could
be done using the HTML Tidy command line, or I think someone has built a
Python interface to Tidy.
Fred L. Drake, Jr. <fdrake at acm.org>
PythonLabs at Zope Corporation
More information about the XML-SIG