<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">

<HTML><HEAD>

<META http-equiv=Content-Type content=text/html;charset=iso-8859-1>

<META content="MSHTML 6.00.6000.16825" name=GENERATOR></HEAD>

<BODY id=MailContainerBody 

style="PADDING-RIGHT: 10px; PADDING-LEFT: 10px; PADDING-TOP: 15px" leftMargin=0 

topMargin=0 CanvasTabStop="true" name="Compose message area">

<DIV><FONT face=Garamond color=#000080><FONT color=#000000>Stefan / Alan et 

al</FONT></FONT></DIV>

<DIV><FONT face=Garamond></FONT>&nbsp;</DIV>

<DIV><FONT face=Garamond>Thank-you for all the advice and links.&nbsp; A simple 

script using etree is scanning 500K+ xhtml files and 2 files with mismatched 

files have been found so far which can be fixed manually.&nbsp; I'll definitely 

look into "tidy" as it sounds pretty cool.&nbsp; Because, we are running data 

processing programs on a 64-bit Windows box (yes, I know, I know ...) using 

64-bit Python we can only use pure Python-only libraries.&nbsp; I believe that 

lxml uses C libraries.&nbsp; Again, thanks to everyone - a terrific community as 

usual!</FONT></DIV>

<DIV><FONT face=Garamond color=#000080><FONT 

color=#000000></FONT></FONT>&nbsp;</DIV>

<DIV><FONT face=Garamond color=#000080><FONT color=#000000><FONT 

color=#000080></FONT>&nbsp;</DIV>

<DIV>

<HR>

</DIV>

<DIV>Message: 5<BR>Date: Tue, 28 Apr 2009 19:39:17 +0200<BR>From: Stefan Behnel 

&lt;</FONT><A title="mailto:stefan_ml@behnel.de&#10;CTRL + Click to follow link" 

href="mailto:stefan_ml@behnel.de">stefan_ml@behnel.de</A><FONT 

color=#000000>&gt;<BR>Subject: Re: [Tutor] finding mismatched or unpaired html 

tags<BR>To: </FONT><A 

title="mailto:tutor@python.org&#10;CTRL + Click to follow link" 

href="mailto:tutor@python.org">tutor@python.org</A><BR><FONT 

color=#000000>Message-ID: &lt;</FONT><A 

href="mailto:gt7f05$1ov$1@ger.gmane.org">gt7f05$1ov$1@ger.gmane.org</A><FONT 

color=#000000>&gt;<BR>Content-Type: text/plain; 

charset=ISO-8859-1<BR><BR>A.T.Hofkamp wrote:<BR>&gt; Dinesh B Vadhia 

wrote:<BR>&gt;&gt; I'm processing tens of thousands of html files and a few of 

them<BR>&gt;&gt; contain mismatched tags and ElementTree throws the 

error:<BR>&gt;&gt;<BR>&gt;&gt; "Unexpected error opening 

J:/F2/663/blahblah.html: mismatched tag:<BR>&gt;&gt; line 124, column 

8"<BR>&gt;&gt;<BR>&gt;&gt; I now want to scan each file and simply identify each 

mismatched or<BR>&gt;&gt; unpaired<BR>&gt; tags (by line number) in each file. 

I've read the ElementTree docs and<BR>&gt; cannot<BR>&gt; see anything obvious 

how to do this. I know this is a common problem but<BR>&gt; feeling a bit 

clueless here - any ideas?<BR>&gt; <BR>&gt; Don't use elementTree, use 

BeautifulSoup instead.<BR><BR>Actually, now that the code is there anyway, the 

OP might be happier with<BR>lxml.html. It's a lot faster than BeautifulSoup, 

uses less memory, and<BR>often parses broken HTML better. It's also more user 

friendly for many HTML<BR>tasks.<BR><BR></FONT><A 

href="http://codespeak.net/lxml/lxmlhtml.html">http://codespeak.net/lxml/lxmlhtml.html</A><BR><BR><FONT 

color=#000000>This might also be worth a read:<BR><BR></FONT><A 

href="http://blog.ianbicking.org/2008/12/10/lxml-an-underappreciated-web-scraping-library/">http://blog.ianbicking.org/2008/12/10/lxml-an-underappreciated-web-scraping-library/</A><BR><BR><FONT 

color=#000000>Stefan</FONT><BR></DIV></FONT></BODY></HTML>