[XML-SIG] minidom w/ HTML

Mike Hostetler hostetlerm at gmail.com
Mon Jun 28 14:54:41 EDT 2004


On Mon, 21 Jun 2004 12:25:59 -0700, jennyw
<jennyw at colorfulexpressions.com> wrote:
> 
> I have a project where I need to parse html files that are table heavy
> (a calendar, actually), and I thought minidom would be perfect for my
> needs. The problem is that the HTML that I'm trying to parse isn't quite
> valid XML -- mostly minor things, but enough so that minidom won't work.
>   Is there a something that would convert an html file into XML that
> would work with minidom? Or is there something better, like something
> more geared towards html that I should be looking at?
> 

I've recently discovered BeautifulSoup, and it works wonderfully for
parsing HTML.:

http://www.crummy.com/software/BeautifulSoup/

I've done the "run through Tidy and then use minidom" approach before.
 It works fine, except that it can be quite slow, especially if the
HTML isn't anything that resembles XHTML.

-- mikeh



More information about the XML-SIG mailing list