Extracting xml from html

Paul Boddie paul at boddie.org.uk
Mon Sep 17 17:01:59 EDT 2007


On 17 Sep, 22:31, kyoso... at gmail.com wrote:
>
> What's the best way to get at the XML? Do I need to somehow parse it
> using the HTMLParser and then parse that with minidom or what?

Probably easiest is to use an XML processing toolkit or library which
supports HTML parsing. Since the libxml2 library (written in C) makes
a fairly good job of HTML parsing, I would suggest either libxml2dom
(for a DOM-like API) or lxml (for an ElementTree-like API) as suitable
Python wrappers of libxml2. Of course, HTMLParser or SGMLParser should
work, but the programming style is a bit more convoluted unless you're
used to XML processing using a SAX-like API.

Paul

P.S. I'm biased towards libxml2dom, being the developer, but I use it
routinely and it generally does the job for me.




More information about the Python-list mailing list