[Tutor] finding mismatched or unpaired html tags
Alan Gauld
alan.gauld at btinternet.com
Tue Apr 28 18:20:48 CEST 2009
"Dinesh B Vadhia" <dineshbvadhia at hotmail.com> wrote
> I'm processing tens of thousands of html files and a few of them contain
> mismatched tags and ElementTree throws the error:
>
> "Unexpected error opening J:/F2/663/blahblah.html: mismatched tag: line
> 124, column 8"
IMHO the best way to cleanse HTML files is to use tidy.
It is available for *nix and Windows and has a wealth of
options to control it's output. It can even converty html into
valid xhtml which ElementTree should be happy with.
http://tidy.sourceforge.net/
It may not be Python but it's fast and effective!
And there is a Python wrapper:
http://utidylib.berlios.de/
although I've never used it.
--
Alan Gauld
Author of the Learn to Program web site
http://www.alan-g.me.uk/
More information about the Tutor
mailing list