Extracting xml from html
kyosohma at gmail.com
kyosohma at gmail.com
Tue Sep 18 15:33:40 EDT 2007
On Sep 18, 1:56 am, Stefan Behnel <stefan.behnel-n05... at web.de> wrote:
> kyoso... at gmail.com wrote:
> > I am attempting to extract some XML from an HTML document that I get
> > returned from a form based web page. For some reason, I cannot figure
> > out how to do this.
> > Here's a sample of the html:
>
> > <html>
> > <body>
> > lots of screwy text including divs and spans
> > <Row status="o">
> > <RecordNum>1126264</RecordNum>
> > <Make>Mitsubishi</Make>
> > <Model>Mirage DE</Model>
> > </Row>
> > </body>
> > </html>
>
> > What's the best way to get at the XML? Do I need to somehow parse it
> > using the HTMLParser and then parse that with minidom or what?
>
> lxml makes this pretty easy:
>
> >>> parser = etree.HTMLParser()
> >>> tree = etree.parse(the_file_or_url, parser)
>
> This is actually a tree that can be treated as XML, e.g. with XPath, XSLT,
> tree iteration, ... You will also get plain XML when you serialise it to XML:
>
> >>> xml_string = etree.tostring(tree)
>
> Note that this doesn't add any namespaces, so you will not magically get valid
> XHTML or something. You could rewrite the tags by hand, though.
>
> Stefan
I got it to work with lxml. See below:
def Parser(filename):
parser = etree.HTMLParser()
tree = etree.parse(r'path/to/nextpage.htm', parser)
xml_string = etree.tostring(tree)
events = ("recordnum", "primaryowner", "customeraddress")
context = etree.iterparse(StringIO(xml_string), tag='')
for action, elem in context:
tag = elem.tag
if tag == 'primaryowner':
owner = elem.text
elif tag == 'customeraddress':
address = elem.text
else:
pass
print 'Primary Owner: %s' % owner
print 'Address: %s' % address
Does this make sense? It works pretty well, but I don't really
understand everything that I'm doing.
Mike
More information about the Python-list
mailing list