HTML File Parsing

Felipe De Bene ttboy86 at
Tue Oct 28 20:58:56 CET 2008

I'm having problems parsing an HTML file with the following syntax :

<TABLE cellspacing=0 cellpadding=0 ALIGN=CENTER BORDER=1 width='100%'>
    <TH BGCOLOR='#c0c0c0' Width='3%'>User ID</TH>
    <TH Width='10%' BGCOLOR='#c0c0c0'>Name</TH><TH width='7%'
and so on....

whenever I feed the parser with such file I get the error :

Traceback (most recent call last):
  File "C:\Documents and Settings\Administrator\My Documents\workspace
\thread\src\", line 91, in <module>
  File "C:\Documents and Settings\Administrator\My Documents\workspace
\thread\src\", line 16, in parse
  File "C:\Python25\lib\", line 110, in feed
  File "C:\Python25\lib\", line 152, in goahead
    k = self.parse_endtag(i)
  File "C:\Python25\lib\", line 316, in parse_endtag
    self.error("bad end tag: %r" % (rawdata[i:j],))
  File "C:\Python25\lib\", line 117, in error
    raise HTMLParseError(message, self.getpos())
HTMLParser.HTMLParseError: bad end tag: "</TH BGCOLOR='#c0c0c0'>", at
line 515, column 45

Googling around I've found a solution to a similar situation, over and
over again :

but coding :

you can disable proper parsing by setting the CDATA_CONTENT_ELEMENTS
attribute on the parser instance, before you start parsing. by
default, it is
set to
CDATA_CONTENT_ELEMENTS = ("script", "style")
setting it to an empty tuple disables HTML-compliant handling for
p = HTMLParser()

didn't solve my problem. I've made a little modification then to instead that solved the problem, as follows:
original: endtagfind = re.compile('</\s*([a-zA-Z][-.a-zA-Z0-9:_]*)?(.*)
my version : endtagfind = re.compile('</\s*([a-zA-Z][-.a-zA-Z0-9:_]*)

it worked ok for all the files I needed and also for a different file
I also parse using the same library. I know it might sound stupid but
I was just wondering if there's a better way of solving that problem
than just modifying the standard library. Any clue ?

thx in advance,

More information about the Python-list mailing list