Extracting xml from html

kyosohma at gmail.com kyosohma at gmail.com
Mon Sep 17 22:31:19 CEST 2007


I am attempting to extract some XML from an HTML document that I get
returned from a form based web page. For some reason, I cannot figure
out how to do this. I thought I could use the minidom module to do it,
but all I get is a screwy traceback:

Traceback (most recent call last):
  File "\\mcisnt1\repl$\Scripts\PythonPackages\Development\clippy
\xml_parser.py", line 69, in ?
    inst = ApptParser(url)
  File "\\mcisnt1\repl$\Scripts\PythonPackages\Development\clippy
\xml_parser.py", line 19, in __init__
    xml = self.getXml(url)
  File "\\mcisnt1\repl$\Scripts\PythonPackages\Development\clippy
\xml_parser.py", line 30, in getXml
    doc = xml.dom.minidom.parse(f)
  File "C:\Python24\lib\xml\dom\minidom.py", line 1915, in parse
    return expatbuilder.parse(file)
  File "C:\Python24\lib\xml\dom\expatbuilder.py", line 928, in parse
    result = builder.parseFile(file)
  File "C:\Python24\lib\xml\dom\expatbuilder.py", line 207, in
    parser.Parse(buffer, 0)
ExpatError: mismatched tag: line 1, column 357

Here's a sample of the html:

lots of screwy text including divs and spans
<Row status="o">
    <Model>Mirage DE</Model>

What's the best way to get at the XML? Do I need to somehow parse it
using the HTMLParser and then parse that with minidom or what?

Thanks a lot!


More information about the Python-list mailing list