SGMLParseError
Jay Parlar
jparlar at home.com
Wed Aug 22 21:18:02 EDT 2001
In trying to get the HTMLParser to work, I occasionally come upon the following problem.
>>> from formatter import AbstractFormatter,DumbWriter
>>> from htmllib import HTMLParser
>>> parser = HTMLParser(AbstractFormatter(DumbWriter()))
>>> parser.feed(urllib.urlopen('http://cbc.ca').read())
CBC.CA Wednesday, Aug 22, 2001 nmweb02 shop[1] · help[2] · contact[3]
· search[4] (image)[5] Email News Digest[6] | Audio[7] | Video[8] |
CBC Radio Newscast[9] | CBC Newsworld Newscast[10] Traceback (most recent call last):
File "<interactive input>", line 1, in ?
File "c:\program files\python21\lib\sgmllib.py", line 91, in feed
self.goahead(0)
File "c:\program files\python21\lib\sgmllib.py", line 158, in goahead
k = self.parse_declaration(i)
File "c:\program files\python21\lib\sgmllib.py", line 238, in parse_declaration
raise SGMLParseError(
SGMLParseError: unexpected char in declaration: '<'
It doesn't happen with every page (in fact, I have code which runs HTMLParser on over 400 separate pages, and
only five of the pages cause this), but I really can't have it happening at all.
I've checked the list archives, and haven't found any solutions to this problem. Is there anything I can do besides
catching the error? I'd really like some solution other than ignoring the pages that create this type of error.
I've also been trying to use MSHTML as my parser, but that's giving me a whole array of problems in itself. The people
on the Microsoft group I've been going to don't seem to be nearly as helpful as the Python people are :)
Jay Parlar
----------------------------------------------------------------
Software Engineering III
McMaster University
Hamilton, Ontario, Canada
"Though there are many paths
At the foot of the mountain
All those who reach the top
See the same moon."
More information about the Python-list
mailing list