html parser , unexpected '<' char in declaration
Sakcee
sakcee at gmail.com
Mon Feb 20 18:01:53 EST 2006
html =
'<html><!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
<head></head> <body bgcolor=#ffffff>\r\n Foo foo , blah blah
</body></html>'
>>> import htmllib
>>> import formatter
>>> parser=htmllib.HTMLParser(formatter.NullFormatter())
>>> parser.feed(html)
Traceback (most recent call last):
File "<stdin>", line 1, in ?
File "/usr/lib/python2.4/sgmllib.py", line 95, in feed
self.goahead(0)
File "/usr/lib/python2.4/sgmllib.py", line 165, in goahead
k = self.parse_declaration(i)
File "/usr/lib/python2.4/markupbase.py", line 132, in parse_declaration
self.error(
File "/usr/lib/python2.4/htmllib.py", line 40, in error
raise HTMLParseError(message)
htmllib.HTMLParseError: unexpected '<' char in declaration
the error is generated by unclosed DOCTYPE declaration
what is the best way to handle this kind of document. should I use
regex to check and strip, or does HTMLParser offers something? , can i
override default sgmllib behaviour
I have to work with this htmllib because of existing modules .
thanks
More information about the Python-list
mailing list