HTMLParser.HTMLParseError: EOF in middle of construct

Stefan Behnel stefan.behnel-n05pAM at
Tue Jun 19 08:22:44 CEST 2007

Sergio Monteiro Basto wrote:
> Can someone explain me, what is wrong with this site ?
> python > test
> HTMLParser.HTMLParseError: EOF in middle of construct, at line 1173,
> column 1
> at line 1173 of test file is perfectly normal .
> I like to know what I have to clean up before parse the html page 
> I send in attach the python code .

You don't want to do these things with HTMLParser. lxml is much easier to use
and supports broken HTML (as in the page you're parsing).

Note that there is a SVN branch of lxml that comes with an html package
(lxml.html) that provides a "clean()" function. Just parse the page with the
HTML parser provided by the package (a few lines), then call the clean()
function on it with the parameters you want to get rid of scripts and the like.

The docs:

The SVN branch:

You seem to be on Linux, so compiling lxml should be simple enough:

Have fun,

More information about the Python-list mailing list