<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML><HEAD>
<META http-equiv=Content-Type content=text/html;charset=iso-8859-1>
<META content="MSHTML 6.00.6000.16825" name=GENERATOR></HEAD>
<BODY id=MailContainerBody
style="PADDING-RIGHT: 10px; PADDING-LEFT: 10px; PADDING-TOP: 15px" leftMargin=0
topMargin=0 CanvasTabStop="true" name="Compose message area">
<DIV><FONT face=Garamond color=#000080><FONT color=#000000>Stefan / Alan et
al</FONT></FONT></DIV>
<DIV><FONT face=Garamond></FONT> </DIV>
<DIV><FONT face=Garamond>Thank-you for all the advice and links. A simple
script using etree is scanning 500K+ xhtml files and 2 files with mismatched
files have been found so far which can be fixed manually. I'll definitely
look into "tidy" as it sounds pretty cool. Because, we are running data
processing programs on a 64-bit Windows box (yes, I know, I know ...) using
64-bit Python we can only use pure Python-only libraries. I believe that
lxml uses C libraries. Again, thanks to everyone - a terrific community as
usual!</FONT></DIV>
<DIV><FONT face=Garamond color=#000080><FONT
color=#000000></FONT></FONT> </DIV>
<DIV><FONT face=Garamond color=#000080><FONT color=#000000><FONT
color=#000080></FONT> </DIV>
<DIV>
<HR>
</DIV>
<DIV>Message: 5<BR>Date: Tue, 28 Apr 2009 19:39:17 +0200<BR>From: Stefan Behnel
<</FONT><A title="mailto:stefan_ml@behnel.de CTRL + Click to follow link"
href="mailto:stefan_ml@behnel.de">stefan_ml@behnel.de</A><FONT
color=#000000>><BR>Subject: Re: [Tutor] finding mismatched or unpaired html
tags<BR>To: </FONT><A
title="mailto:tutor@python.org CTRL + Click to follow link"
href="mailto:tutor@python.org">tutor@python.org</A><BR><FONT
color=#000000>Message-ID: <</FONT><A
href="mailto:gt7f05$1ov$1@ger.gmane.org">gt7f05$1ov$1@ger.gmane.org</A><FONT
color=#000000>><BR>Content-Type: text/plain;
charset=ISO-8859-1<BR><BR>A.T.Hofkamp wrote:<BR>> Dinesh B Vadhia
wrote:<BR>>> I'm processing tens of thousands of html files and a few of
them<BR>>> contain mismatched tags and ElementTree throws the
error:<BR>>><BR>>> "Unexpected error opening
J:/F2/663/blahblah.html: mismatched tag:<BR>>> line 124, column
8"<BR>>><BR>>> I now want to scan each file and simply identify each
mismatched or<BR>>> unpaired<BR>> tags (by line number) in each file.
I've read the ElementTree docs and<BR>> cannot<BR>> see anything obvious
how to do this. I know this is a common problem but<BR>> feeling a bit
clueless here - any ideas?<BR>> <BR>> Don't use elementTree, use
BeautifulSoup instead.<BR><BR>Actually, now that the code is there anyway, the
OP might be happier with<BR>lxml.html. It's a lot faster than BeautifulSoup,
uses less memory, and<BR>often parses broken HTML better. It's also more user
friendly for many HTML<BR>tasks.<BR><BR></FONT><A
href="http://codespeak.net/lxml/lxmlhtml.html">http://codespeak.net/lxml/lxmlhtml.html</A><BR><BR><FONT
color=#000000>This might also be worth a read:<BR><BR></FONT><A
href="http://blog.ianbicking.org/2008/12/10/lxml-an-underappreciated-web-scraping-library/">http://blog.ianbicking.org/2008/12/10/lxml-an-underappreciated-web-scraping-library/</A><BR><BR><FONT
color=#000000>Stefan</FONT><BR></DIV></FONT></BODY></HTML>