HTML File Parsing

Stefan Behnel stefan_ml at
Tue Oct 28 21:18:33 CET 2008

Felipe De Bene wrote:
> I'm having problems parsing an HTML file with the following syntax :
> <TABLE cellspacing=0 cellpadding=0 ALIGN=CENTER BORDER=1 width='100%'>
>     <TH BGCOLOR='#c0c0c0' Width='3%'>User ID</TH>
>     <TH Width='10%' BGCOLOR='#c0c0c0'>Name</TH><TH width='7%'
> BGCOLOR='#c0c0c0'>Date</TH>
> and so on....
> whenever I feed the parser with such file I get the error :
> HTMLParser.HTMLParseError: bad end tag: "</TH BGCOLOR='#c0c0c0'>", at
> line 515, column 45

Your HTML page is not HTML, i.e. it is broken. Python's HTMLParser is not made
for parsing broken HTML. However, you can use the parse of lxml.html to fix up
your HTML for you.


More information about the Python-list mailing list