HTML File Parsing
Felipe De Bene
ttboy86 at gmail.com
Tue Oct 28 15:58:56 EDT 2008
I'm having problems parsing an HTML file with the following syntax :
<TABLE cellspacing=0 cellpadding=0 ALIGN=CENTER BORDER=1 width='100%'>
<TH BGCOLOR='#c0c0c0' Width='3%'>User ID</TH>
<TH Width='10%' BGCOLOR='#c0c0c0'>Name</TH><TH width='7%'
BGCOLOR='#c0c0c0'>Date</TH>
and so on....
whenever I feed the parser with such file I get the error :
Traceback (most recent call last):
File "C:\Documents and Settings\Administrator\My Documents\workspace
\thread\src\parser.py", line 91, in <module>
p.parse(thechange)
File "C:\Documents and Settings\Administrator\My Documents\workspace
\thread\src\parser.py", line 16, in parse
self.feed(s)
File "C:\Python25\lib\HTMLParser.py", line 110, in feed
self.goahead(0)
File "C:\Python25\lib\HTMLParser.py", line 152, in goahead
k = self.parse_endtag(i)
File "C:\Python25\lib\HTMLParser.py", line 316, in parse_endtag
self.error("bad end tag: %r" % (rawdata[i:j],))
File "C:\Python25\lib\HTMLParser.py", line 117, in error
raise HTMLParseError(message, self.getpos())
HTMLParser.HTMLParseError: bad end tag: "</TH BGCOLOR='#c0c0c0'>", at
line 515, column 45
Googling around I've found a solution to a similar situation, over and
over again :
http://64.233.169.104/search?q=cache:zOmjwM_sGBcJ:coding.derkeiler.com/pdf/Archive/Python/comp.lang.python/2006-02/msg00026.pdf+CDATA_CONTENT_ELEMENTS&hl=pt-BR&ct=clnk&cd=5&gl=br&client=firefox-a
but coding :
you can disable proper parsing by setting the CDATA_CONTENT_ELEMENTS
attribute on the parser instance, before you start parsing. by
default, it is
set to
CDATA_CONTENT_ELEMENTS = ("script", "style")
setting it to an empty tuple disables HTML-compliant handling for
these
elements:
p = HTMLParser()
p.CDATA_CONTENT_ELEMENTS = ()
p.feed(f.read())
didn't solve my problem. I've made a little modification then to
HTMLParser.py instead that solved the problem, as follows:
original: endtagfind = re.compile('</\s*([a-zA-Z][-.a-zA-Z0-9:_]*)?(.*)
\s*>')
my version : endtagfind = re.compile('</\s*([a-zA-Z][-.a-zA-Z0-9:_]*)
\s*>')
it worked ok for all the files I needed and also for a different file
I also parse using the same library. I know it might sound stupid but
I was just wondering if there's a better way of solving that problem
than just modifying the standard library. Any clue ?
thx in advance,
Felipe.
More information about the Python-list
mailing list