HTML File Parsing

Tue Oct 28 15:58:56 EDT 2008

I'm having problems parsing an HTML file with the following syntax :

<TABLE cellspacing=0 cellpadding=0 ALIGN=CENTER BORDER=1 width='100%'>
    <TH BGCOLOR='#c0c0c0' Width='3%'>User ID</TH>
    <TH Width='10%' BGCOLOR='#c0c0c0'>Name</TH><TH width='7%'
BGCOLOR='#c0c0c0'>Date</TH>
and so on....

whenever I feed the parser with such file I get the error :

Traceback (most recent call last):
  File "C:\Documents and Settings\Administrator\My Documents\workspace
\thread\src\parser.py", line 91, in <module>
    p.parse(thechange)
  File "C:\Documents and Settings\Administrator\My Documents\workspace
\thread\src\parser.py", line 16, in parse
    self.feed(s)
  File "C:\Python25\lib\HTMLParser.py", line 110, in feed
    self.goahead(0)
  File "C:\Python25\lib\HTMLParser.py", line 152, in goahead
    k = self.parse_endtag(i)
  File "C:\Python25\lib\HTMLParser.py", line 316, in parse_endtag
    self.error("bad end tag: %r" % (rawdata[i:j],))
  File "C:\Python25\lib\HTMLParser.py", line 117, in error
    raise HTMLParseError(message, self.getpos())
HTMLParser.HTMLParseError: bad end tag: "</TH BGCOLOR='#c0c0c0'>", at
line 515, column 45

Googling around I've found a solution to a similar situation, over and
over again :
http://64.233.169.104/search?q=cache:zOmjwM_sGBcJ:coding.derkeiler.com/pdf/Archive/Python/comp.lang.python/2006-02/msg00026.pdf+CDATA_CONTENT_ELEMENTS&hl=pt-BR&ct=clnk&cd=5&gl=br&client=firefox-a

but coding :

you can disable proper parsing by setting the CDATA_CONTENT_ELEMENTS
attribute on the parser instance, before you start parsing. by
default, it is
set to
CDATA_CONTENT_ELEMENTS = ("script", "style")
setting it to an empty tuple disables HTML-compliant handling for
these
elements:
p = HTMLParser()
p.CDATA_CONTENT_ELEMENTS = ()
p.feed(f.read())

didn't solve my problem. I've made a little modification then to
HTMLParser.py instead that solved the problem, as follows:
original: endtagfind = re.compile('</\s*([a-zA-Z][-.a-zA-Z0-9:_]*)?(.*)
\s*>')
my version : endtagfind = re.compile('</\s*([a-zA-Z][-.a-zA-Z0-9:_]*)
\s*>')

it worked ok for all the files I needed and also for a different file
I also parse using the same library. I know it might sound stupid but
I was just wondering if there's a better way of solving that problem
than just modifying the standard library. Any clue ?

thx in advance,
Felipe.