[XML-SIG] minidom w/ HTML

Fred L. Drake, Jr. fdrake at acm.org
Thu Jun 24 11:00:23 EDT 2004


On Monday 21 June 2004 03:25 pm, jennyw wrote:
 > I have a project where I need to parse html files that are table heavy
 > (a calendar, actually), and I thought minidom would be perfect for my
 > needs. The problem is that the HTML that I'm trying to parse isn't quite
 > valid XML -- mostly minor things, but enough so that minidom won't work.

I wouldn't generally expect HTML to be parsable as XML, only XHTML.

 >   Is there a something that would convert an html file into XML that
 > would work with minidom? Or is there something better, like something
 > more geared towards html that I should be looking at?

You could run the HTML through HTML Tidy before parsing it as XML.  This could 
be done using the HTML Tidy command line, or I think someone has built a 
Python interface to Tidy.


  -Fred

-- 
Fred L. Drake, Jr.  <fdrake at acm.org>
PythonLabs at Zope Corporation




More information about the XML-SIG mailing list