[Tutor] finding mismatched or unpaired html tags

Alan Gauld alan.gauld at btinternet.com
Tue Apr 28 18:20:48 CEST 2009


"Dinesh B Vadhia" <dineshbvadhia at hotmail.com> wrote

> I'm processing tens of thousands of html files and a few of them contain
> mismatched tags and ElementTree throws the error:
>
> "Unexpected error opening J:/F2/663/blahblah.html: mismatched tag: line 
> 124, column 8"

IMHO the best way to cleanse HTML files is to use tidy.
It is available for *nix and Windows and has a wealth of
options to control it's output. It can even converty html into
valid xhtml which ElementTree should be happy with.

http://tidy.sourceforge.net/

It may not be Python but it's fast and effective!
And there is a Python wrapper:

http://utidylib.berlios.de/

although I've never used it.

-- 
Alan Gauld
Author of the Learn to Program web site
http://www.alan-g.me.uk/ 




More information about the Tutor mailing list