Extracting xml from html

Gabriel Genellina gagsl-py2 at yahoo.com.ar
Mon Sep 17 23:51:23 CEST 2007

En Mon, 17 Sep 2007 17:31:19 -0300, <kyosohma at gmail.com> escribi�:

> I am attempting to extract some XML from an HTML document that I get
> returned from a form based web page. For some reason, I cannot figure
> out how to do this. I thought I could use the minidom module to do it,
> but all I get is a screwy traceback:
> Traceback (most recent call last):
>   File "C:\Python24\lib\xml\dom\expatbuilder.py", line 207, in
> parseFile
>     parser.Parse(buffer, 0)
> ExpatError: mismatched tag: line 1, column 357

So your HTML is not a well formed XML document, as many html pages, and  
you can't use an XML parser. (even a valid HTML document may not be valid  
XML). Let's try with some mismatched tags:

py> text = '''<html>
... <body>
... <p>lots of <div>screwy text including divs and <span>spans</p>
... <Row status="o">
...     <RecordNum>1126264</RecordNum>
...     <Make>Mitsubishi</Make>
...     <Model>Mirage DE</Model>
... </Row>
... </body>
... </html>'''
py> import xml.dom.minidom
py> doc = xml.dom.minidom.parseString(text)
Traceback (most recent call last):
xml.parsers.expat.ExpatError: mismatched tag: line 3, column 60

You will need a more robust parser, like BeautifulSoup  

py> from BeautifulSoup import BeautifulSoup
py> soup = BeautifulSoup(text)
py> for row in soup.findAll("row"):
...   print row.recordnum, row.make.contents, row.model.string
<recordnum>1126264</recordnum> [u'Mitsubishi'] Mirage DE

Depending on your document, you may prefer to extract the XML blocks using  
BeautifulSoup, and then parse each one using BeautifulStoneSoup (the XML  
parser) or xml.etree.ElementTree

Gabriel Genellina

More information about the Python-list mailing list