[XML-SIG] minidom w/ HTML
Andrew Shearer
andrew at shearersoftware.com
Fri Jun 25 00:35:49 EDT 2004
You could use Python's HTMLParser module[1] or my own HTMLFilter
module[2]. Both present a SAX-like interface that calls back to your
code as tags fly by, rather than the DOM approach of handing you a
fully-formed, consistent data structure made from the document.
The DOM approach is complicated because of the non-well-formed nature
of typical HTML, while the SAX-like interface is a more natural fit.
[1] http://docs.python.org/lib/module-HTMLParser.html
[2] http://www.shearersoftware.com/software/developers/htmlfilter/
> From: jennyw <jennyw at colorfulexpressions.com>
> Message-ID: <cb7co8$2cb$1 at sea.gmane.org>
>
> I have a project where I need to parse html files that are table heavy
> (a calendar, actually), and I thought minidom would be perfect for my
> needs. The problem is that the HTML that I'm trying to parse isn't
> quite
> valid XML -- mostly minor things, but enough so that minidom won't
> work.
> Is there a something that would convert an html file into XML that
> would work with minidom? Or is there something better, like something
> more geared towards html that I should be looking at?
--
Andrew Shearer
Senior Analyst, Medical Computing
IS Applications Group
Lifespan
More information about the XML-SIG
mailing list