[Tutor] finding mismatched or unpaired html tags
mwalsh at mwalsh.org
Tue Apr 28 15:54:33 CEST 2009
> Dinesh B Vadhia wrote:
>> I'm processing tens of thousands of html files and a few of them
>> contain mismatched tags and ElementTree throws the error:
>> "Unexpected error opening J:/F2/663/blahblah.html: mismatched tag:
>> line 124, column 8"
>> I now want to scan each file and simply identify each mismatched or
> tags (by line number) in each file. I've read the ElementTree docs and
> see anything obvious how to do this. I know this is a common problem but
> feeling a bit clueless here - any ideas?
> Don't use elementTree, use BeautifulSoup instead.
> elementTree expects perfect input, typically generated by another computer.
> BeautifulSoup is designed to handle your everyday HTML page, filled with
> errors of all possible kinds.
But it also modifies the source html by default, adding closing tags,
etc. Important to know, I suppose, if you intend to re-write the html
files you parse with BeautifulSoup.
Also, unless you're running python 3.0 or greater, use the 3.0.x series
of BeautifulSoup -- otherwise you may run into the same issue.
More information about the Tutor