HTML File Parsing
Stefan Behnel
stefan_ml at behnel.de
Tue Oct 28 16:18:33 EDT 2008
Felipe De Bene wrote:
> I'm having problems parsing an HTML file with the following syntax :
>
> <TABLE cellspacing=0 cellpadding=0 ALIGN=CENTER BORDER=1 width='100%'>
> <TH BGCOLOR='#c0c0c0' Width='3%'>User ID</TH>
> <TH Width='10%' BGCOLOR='#c0c0c0'>Name</TH><TH width='7%'
> BGCOLOR='#c0c0c0'>Date</TH>
> and so on....
>
> whenever I feed the parser with such file I get the error :
>
> HTMLParser.HTMLParseError: bad end tag: "</TH BGCOLOR='#c0c0c0'>", at
> line 515, column 45
Your HTML page is not HTML, i.e. it is broken. Python's HTMLParser is not made
for parsing broken HTML. However, you can use the parse of lxml.html to fix up
your HTML for you.
http://codespeak.net/lxml/
Stefan
More information about the Python-list
mailing list