[Tutor] finding mismatched or unpaired html tags
stefan_ml at behnel.de
Tue Apr 28 19:39:17 CEST 2009
> Dinesh B Vadhia wrote:
>> I'm processing tens of thousands of html files and a few of them
>> contain mismatched tags and ElementTree throws the error:
>> "Unexpected error opening J:/F2/663/blahblah.html: mismatched tag:
>> line 124, column 8"
>> I now want to scan each file and simply identify each mismatched or
> tags (by line number) in each file. I've read the ElementTree docs and
> see anything obvious how to do this. I know this is a common problem but
> feeling a bit clueless here - any ideas?
> Don't use elementTree, use BeautifulSoup instead.
Actually, now that the code is there anyway, the OP might be happier with
lxml.html. It's a lot faster than BeautifulSoup, uses less memory, and
often parses broken HTML better. It's also more user friendly for many HTML
This might also be worth a read:
More information about the Tutor