Noob trying to parse bad HTML using xml.etree.ElementTree
Peter Otten
__peter__ at web.de
Sun Dec 30 05:18:24 EST 2012
Morten Guldager wrote:
> 'Aloha Friends!
>
> I'm trying to process some HTML using xml.etree.ElementTree
> Problem is that the HTML I'm trying to read have some not properly closed
> tags, as the <img> shown in line 8 below.
>
> 1 from xml.etree import ElementTree
> 2
> 3 tree = ElementTree
> 4 e = tree.fromstring(
> 5 """
> 6 <html>
> 7 <body>
> 8 <img src='mogul.jpg'>
> 9 </body>
> 10 </html>
> 11 """)
>
> Python whines: xml.etree.ElementTree.ParseError: mismatched tag: line 5,
> column 14
>
> I definitely do want to work DOM style, having the whole shebang loaded
> into a nice structure before I start the real work.
>
> Question is if it's possible to tweak xml.etree.ElementTree to accept, and
> understand sloppy html, or if you have suggestions for similar easy to use
> framework, preferably among the included batteries?
The <img> tag doesn't have a closing counterpart in HTML. That implies that
valid HTML isn't valid XML and that you cannot use xml.etree with HTML.
While it is not in the standard library a good alternative for XML that can
deal with HTML, too, is lxml. See <http://lxml.de/lxmlhtml.html>.
It also provides a way to cope with really broken html, modeled after
BeautifulSoup.
More information about the Python-list
mailing list