<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">

<HTML><HEAD>

<META http-equiv=Content-Type content=text/html;charset=iso-8859-1>

<META content="MSHTML 6.00.6000.16825" name=GENERATOR></HEAD>

<BODY id=MailContainerBody 

style="PADDING-RIGHT: 10px; PADDING-LEFT: 10px; PADDING-TOP: 15px" leftMargin=0 

topMargin=0 CanvasTabStop="true" name="Compose message area">

<DIV><FONT face=Garamond color=#000080><FONT color=#000000>A.T. / 

Marty</FONT></FONT></DIV>

<DIV><FONT face=Garamond></FONT>&nbsp;</DIV>

<DIV><FONT face=Garamond>I'd prefer that the html parser didn't replace the 

missing tags as I want to know where and what the problems are.&nbsp; Also, the 

source html documents were generated by another computer ie. they are not web 

page documents.&nbsp; My sense is that it is only a few files out of tens of 

thousands.&nbsp; Cheers ...</FONT></DIV>

<DIV><FONT face=Garamond></FONT>&nbsp;</DIV>

<DIV><FONT face=Garamond>Dinesh</FONT></DIV>

<DIV><FONT face=Garamond color=#000080><FONT 

color=#000000></FONT></FONT>&nbsp;</DIV>

<DIV><FONT face=Garamond color=#000080><FONT color=#000000><FONT 

color=#000080></FONT>&nbsp;</DIV>

<DIV>

<HR>

</DIV>

<DIV>Message: 7<BR>Date: Tue, 28 Apr 2009 08:54:33 -0500<BR>From: Martin Walsh 

&lt;</FONT><A href="">mwalsh@mwalsh.org</A><FONT color=#000000>&gt;<BR>Subject: 

Re: [Tutor] finding mismatched or unpaired html tags<BR>To: "</FONT><A 

href="">tutor@python.org</A><FONT color=#000000>" &lt;</FONT><A 

href="">tutor@python.org</A><FONT color=#000000>&gt;<BR>Message-ID: 

&lt;</FONT><A href="">49F70A99.3050002@mwalsh.org</A><FONT 

color=#000000>&gt;<BR>Content-Type: text/plain; 

charset=us-ascii<BR><BR>A.T.Hofkamp wrote:<BR>&gt; Dinesh B Vadhia 

wrote:<BR>&gt;&gt; I'm processing tens of thousands of html files and a few of 

them<BR>&gt;&gt; contain mismatched tags and ElementTree throws the 

error:<BR>&gt;&gt;<BR>&gt;&gt; "Unexpected error opening 

J:/F2/663/blahblah.html: mismatched tag:<BR>&gt;&gt; line 124, column 

8"<BR>&gt;&gt;<BR>&gt;&gt; I now want to scan each file and simply identify each 

mismatched or<BR>&gt;&gt; unpaired<BR>&gt; tags (by line number) in each file. 

I've read the ElementTree docs and<BR>&gt; cannot<BR>&gt; see anything obvious 

how to do this. I know this is a common problem but<BR>&gt; feeling a bit 

clueless here - any ideas?<BR>&gt;&gt;<BR>&gt; <BR>&gt; Don't use elementTree, 

use BeautifulSoup instead.<BR>&gt; <BR>&gt; elementTree expects perfect input, 

typically generated by another computer.<BR>&gt; BeautifulSoup is designed to 

handle your everyday HTML page, filled with<BR>&gt; errors of all possible 

kinds.<BR><BR>But it also modifies the source html by default, adding closing 

tags,<BR>etc. Important to know, I suppose, if you intend to re-write the 

html<BR>files you parse with BeautifulSoup.<BR><BR>Also, unless you're running 

python 3.0 or greater, use the 3.0.x series<BR>of BeautifulSoup -- otherwise you 

may run into the same issue.<BR><BR></FONT><A 

href="">http://www.crummy.com/software/BeautifulSoup/3.1-problems.html</A><BR><BR><FONT 

color=#000000>HTH,<BR>Marty</FONT><BR><BR></DIV></FONT></BODY></HTML>