[Tutor] finding mismatched or unpaired html tags

Martin Walsh mwalsh at mwalsh.org
Tue Apr 28 15:54:33 CEST 2009


A.T.Hofkamp wrote:
> Dinesh B Vadhia wrote:
>> I'm processing tens of thousands of html files and a few of them
>> contain mismatched tags and ElementTree throws the error:
>>
>> "Unexpected error opening J:/F2/663/blahblah.html: mismatched tag:
>> line 124, column 8"
>>
>> I now want to scan each file and simply identify each mismatched or
>> unpaired
> tags (by line number) in each file. I've read the ElementTree docs and
> cannot
> see anything obvious how to do this. I know this is a common problem but
> feeling a bit clueless here - any ideas?
>>
> 
> Don't use elementTree, use BeautifulSoup instead.
> 
> elementTree expects perfect input, typically generated by another computer.
> BeautifulSoup is designed to handle your everyday HTML page, filled with
> errors of all possible kinds.

But it also modifies the source html by default, adding closing tags,
etc. Important to know, I suppose, if you intend to re-write the html
files you parse with BeautifulSoup.

Also, unless you're running python 3.0 or greater, use the 3.0.x series
of BeautifulSoup -- otherwise you may run into the same issue.

http://www.crummy.com/software/BeautifulSoup/3.1-problems.html

HTH,
Marty







More information about the Tutor mailing list