[XML-SIG] xml.dom.ext.reader.HtmlLib memory leak?
Chuck Bearden
cbearden at hal-pc.org
Wed Aug 25 22:56:39 CEST 2004
On Mon, Aug 23, 2004 at 10:31:11AM -0600, Uche Ogbuji wrote:
>
> Honestly, I don't think DOM is the way I would personally go about
> processing HTML, which is why I was trying to get at whether there was
> another way for you to meet your needs.
I think I understand what you are getting at, but personally I have
found twisted.web.microdom with 'beExtremelyLenient=True', with perhaps
an mx.Tidying stage beforehand, to be invaluable in mining data from
database-generated webpages built with crappy HTML. Consider the pages
displaying individual patent records at the USPTO, e.g. [1]. If you
need to treat such pages as if they were XML records to be parsed and
loaded into a database, something like twisted.web.microdom is a big
help.
Chuck Bearden
[1] http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO1&Sect2=HITOFF&d=PALL&p=1&u=/netahtml/srchnum.htm&r=1&f=G&l=50&s1=6295859.WKU.&OS=PN/6295859&RS=PN/6295859
More information about the XML-SIG
mailing list