<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML><HEAD>
<META http-equiv=Content-Type content=text/html;charset=iso-8859-1>
<META content="MSHTML 6.00.6000.16825" name=GENERATOR></HEAD>
<BODY id=MailContainerBody
style="PADDING-RIGHT: 10px; PADDING-LEFT: 10px; PADDING-TOP: 15px" leftMargin=0
topMargin=0 CanvasTabStop="true" name="Compose message area">
<DIV><FONT face=Garamond color=#000080><FONT color=#000000>A.T. /
Marty</FONT></FONT></DIV>
<DIV><FONT face=Garamond></FONT> </DIV>
<DIV><FONT face=Garamond>I'd prefer that the html parser didn't replace the
missing tags as I want to know where and what the problems are. Also, the
source html documents were generated by another computer ie. they are not web
page documents. My sense is that it is only a few files out of tens of
thousands. Cheers ...</FONT></DIV>
<DIV><FONT face=Garamond></FONT> </DIV>
<DIV><FONT face=Garamond>Dinesh</FONT></DIV>
<DIV><FONT face=Garamond color=#000080><FONT
color=#000000></FONT></FONT> </DIV>
<DIV><FONT face=Garamond color=#000080><FONT color=#000000><FONT
color=#000080></FONT> </DIV>
<DIV>
<HR>
</DIV>
<DIV>Message: 7<BR>Date: Tue, 28 Apr 2009 08:54:33 -0500<BR>From: Martin Walsh
<</FONT><A href="">mwalsh@mwalsh.org</A><FONT color=#000000>><BR>Subject:
Re: [Tutor] finding mismatched or unpaired html tags<BR>To: "</FONT><A
href="">tutor@python.org</A><FONT color=#000000>" <</FONT><A
href="">tutor@python.org</A><FONT color=#000000>><BR>Message-ID:
<</FONT><A href="">49F70A99.3050002@mwalsh.org</A><FONT
color=#000000>><BR>Content-Type: text/plain;
charset=us-ascii<BR><BR>A.T.Hofkamp wrote:<BR>> Dinesh B Vadhia
wrote:<BR>>> I'm processing tens of thousands of html files and a few of
them<BR>>> contain mismatched tags and ElementTree throws the
error:<BR>>><BR>>> "Unexpected error opening
J:/F2/663/blahblah.html: mismatched tag:<BR>>> line 124, column
8"<BR>>><BR>>> I now want to scan each file and simply identify each
mismatched or<BR>>> unpaired<BR>> tags (by line number) in each file.
I've read the ElementTree docs and<BR>> cannot<BR>> see anything obvious
how to do this. I know this is a common problem but<BR>> feeling a bit
clueless here - any ideas?<BR>>><BR>> <BR>> Don't use elementTree,
use BeautifulSoup instead.<BR>> <BR>> elementTree expects perfect input,
typically generated by another computer.<BR>> BeautifulSoup is designed to
handle your everyday HTML page, filled with<BR>> errors of all possible
kinds.<BR><BR>But it also modifies the source html by default, adding closing
tags,<BR>etc. Important to know, I suppose, if you intend to re-write the
html<BR>files you parse with BeautifulSoup.<BR><BR>Also, unless you're running
python 3.0 or greater, use the 3.0.x series<BR>of BeautifulSoup -- otherwise you
may run into the same issue.<BR><BR></FONT><A
href="">http://www.crummy.com/software/BeautifulSoup/3.1-problems.html</A><BR><BR><FONT
color=#000000>HTH,<BR>Marty</FONT><BR><BR></DIV></FONT></BODY></HTML>